Grill at HadoopSummit

42
Grill HQL over Tiered Data Warehouse Amareshwari Sriramadasu Suma Shivaprasad

description

Unified analytics platform which unifies both HDFS and traditional DWH

Transcript of Grill at HadoopSummit

Page 1: Grill at HadoopSummit

GrillHQL over Tiered Data

WarehouseAmareshwari Sriramadasu

Suma Shivaprasad

Page 2: Grill at HadoopSummit

Amareshwari

Sriramadasu

•Engineer at Inmobi

•Working in Hadoop and eco systems since 2007

•Apache Hadoop – PMC

•Apache Hive– Committer

Page 3: Grill at HadoopSummit

SumaShivaprasad

•Engineer at Inmobi

•Working in Hadoop and eco systems since 2011

Page 4: Grill at HadoopSummit

Analytics at Inmobi - Problem areas and Motivation

Why Apache Hive

OLAP Model in Apache Hive

Query examples

Grill – Unified analytics

Demo

Agenda

Page 5: Grill at HadoopSummit

Digital advertising at Inmobi

Courtesy: http://www.liesdamnedlies.com/

Owns & Sells Real estate on digital inventory

Has reach to users

Wants to target Users

Brings money

Market place

Consumer

Page 6: Grill at HadoopSummit

Analytics Use cases

• Understanding Trends & Inference

• Forecasting and Anamoly detection

Data scientists

• Feedback to improve Ad Relevance in Real Time

Engineering systems

• Troubleshooting of issues

Developers

• Publisher/Advertiser specific analytics(dashboards)

Advertisers and publishers

• Tracking metrics

Account managers/Executive team

• Inventory sizing and estimation

Business/Product analysts

Page 7: Grill at HadoopSummit

• Canned/Dashboard queries

• Adhoc queries

• Interactive/Batch queries

• Scheduled queries

• Infer insights through ML algorithms

Categorize the use cases

Page 8: Grill at HadoopSummit

Adhoc querying system Internal Dashboards

Customer facing Dashboards and Reporting

Analytics systems at Inmobi

Page 9: Grill at HadoopSummit

Analytics Warehouses at Inmobi and Scale

• Billions of Ad Requests/Impressions per day

• 170 TB Hadoop Warehouse• 5 TB SQL Columnar Datawarehouse• 70 TB Hbase cluster• DBMS• Spark/Shark in near future

Page 10: Grill at HadoopSummit

Why both Hadoop and SQL warehouse?

Canned

Adhoc

Response Times

IO (Input Records)

Adhoc

Canned Adhoc

Query Engine

Query

Dashboard queries are mostly canned and Interactive

Adhoc queries can be Interactive or batch depending on the data volumes and query complexity

Page 11: Grill at HadoopSummit

•Disparate user experience

•Disparate data storage and execution engines

•Schema management across storages

•Data discovery

•Not leveraging ‘SQL on Hadoop’ community

Problems

Page 12: Grill at HadoopSummit

Analytics at Inmobi - Problem areas and Motivation

Why Apache Hive

OLAP Model in Apache Hive

Query examples

Grill – Unified analytics

Demo

Agenda

Page 13: Grill at HadoopSummit

Associates structure to dataProvides Metastore and catalog service – HcatalogProvides pluggable storage interfaceAccepts SQL like queriesHQL is widely adopted language by systems like Shark, ImpalaHas strong apache community

Data warehouse features like cubes, facts, dimensionsLogical table associated with multiple physical storagesPluggable execution engineQuery lifecycle managementQuery quota managementScheduling queries

Wh

at

does

Hiv

e

pro

vid

e

Wh

at is m

issing in

H

ive

Apache Hive to the rescue

Page 14: Grill at HadoopSummit

Data Layout – OLAP Cube

Aggr Factk : measures (mak <=

ma(k-1)), dims (dak < da(k-1))

…..Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1)

Aggr Fact1: measures (ma1 <= mr), dimensions (da1 < dr)

Raw fact(mr), dimensions(dr)

TimeDimensions, Measures

Cost

Page 15: Grill at HadoopSummit

Data Layout – Snowflake

Aggr Factk : measures (mak <= ma(k-1)), dims (dak

< da(k-1))

…..

Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1)

Aggr Fact1 : measures (ma1 <= mr), dimensions (da1 < dr)

Raw Fact: measures (mr), dimensions(dr)

Dim2_1

Dim3

Dim2

Dim4_1

Dim4

Dim1

Page 16: Grill at HadoopSummit

Analytics at Inmobi - Problem areas and Motivation

Why Apache Hive

OLAP Model in Apache Hive

Query examples

Grill – Unified analytics

Demo

Agenda

Page 17: Grill at HadoopSummit

Data Model

Cube Storage Fact Table

Physical Fact tables

Dimension Table

Physical Dimension

tables

Page 18: Grill at HadoopSummit

Data Model - Cube

Dimension• Simple Dimension• Referenced Dimension• Hierarchical Dimension• Expression Dimension• Timed dimension

Measure• Column Measure • Expression Measure

Cube

Measures Dimensions

Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/

Page 19: Grill at HadoopSummit

Data Model – Storage

Sto

rag

e • Name• End point• Properties• Ex : ProdCluster, StagingCluster,

Postgres1, HBase1, HBase2

Page 20: Grill at HadoopSummit

Data Model – Fact Table

Fact table

Cube

Fact table

Storage

Fact

Ta

ble• Columns

• Cube that it belongs

• Storages on which it is present and the associated update periods

Page 21: Grill at HadoopSummit

Data Model – Dimension tableD

imen

sion

Ta

ble • Columns

• Dimension references

• Storages on which it is present and associated snapshot dump period, if any.

Cube

Dimension table

Dimension table

Dimension table

Storage

Page 22: Grill at HadoopSummit

Data Model – Storage tables and partitionsS

tora

ge

tab

le• Belongs to fact/dimension• Associated storage

descriptor• Partitioned by columns• Naming convention –

storage name followed by fact/dimension name

• Partition can override its storage descriptor

• Fact storage table

Fact table

• Dimension storage table

Dimension table

Page 23: Grill at HadoopSummit

Data Model

Aggrk Fact Table

…..

Aggr2 Fact Table

Aggr1 Fact Table

Raw FactTable

Dim2_1

Dim3

Dim2

Dim4_1

Dim4

Dim1

CUBE

Page 24: Grill at HadoopSummit

Analytics at Inmobi - Problem areas and Motivation

Why Apache Hive

OLAP Model in Apache Hive

Query examples

Grill – Unified analytics

Demo

Agenda

Page 25: Grill at HadoopSummit

CUBE SELECT [DISTINCT] select_expr, select_expr, ...FROM cube_table_referenceWHERE [where_condition AND] TIME_RANGE_IN(colName , from, to)[GROUP BY col_list][HAVING having_expr][ORDER BY colList][LIMIT number]

cube_table_reference:cube_table_factor

| join_tablejoin_table:

cube_table_reference JOIN cube_table_factor [join_condition]

| cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition]

cube_table_factor:cube_name [alias]

| ( cube_table_reference )join_condition:

ON equality_expression ( AND equality_expression )*equality_expression:

expression = expressioncolOrder: ( ASC | DESC )

colList : colName colOrder? (',' colName colOrder?)*

Queries on OLAP cubes

Page 26: Grill at HadoopSummit

Resolve candidate tables and storages

Automatically resolve joins

Resolve aggregates and groupby expressions

Allows SQL over Cube QL

Queries can span multiple storages

Accepts multi time range queries

All Hive QL features

Query features

Page 27: Grill at HadoopSummit

• SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100

• SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100

cube select name, stateid from citytable limit 100

Example query

Page 28: Grill at HadoopSummit

Example query

• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest')

• GROUP BY(citytable.name)

cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)

Page 29: Grill at HadoopSummit

Available in Hive• Data warehouse

features like facts, dimensions

• Logical table associated with multiple physical storages

Available in Grill • Pluggable execution engine

for HQL• Query life cycle

management• Scheduling queries

Where is it available

Page 30: Grill at HadoopSummit

Analytics at Inmobi - Problem areas and Motivation

Why Apache Hive

OLAP Model in Apache Hive

Query examples

Grill – Unified analytics

Demo

Agenda

Page 31: Grill at HadoopSummit

GRILL

Unify the Catalog and Query layer for Adhoc/CannedBatch/Interactive

Reports on single Interface

Page 32: Grill at HadoopSummit

Grill Architecture

Page 33: Grill at HadoopSummit

Implements an interface• explain• execute• executeAsynchronously• fetchResults• Specify all storages it can support

Pluggable execution engine

Page 34: Grill at HadoopSummit

OLAP Cube QL query

Rewrite query for available execution engine’s supported storages

Get cost of the rewritten query from each execution engine

Pick up execution engine with least cost and fire the query

Cube query with multiple execution engines

Page 35: Grill at HadoopSummit

Grill Server

Page 36: Grill at HadoopSummit

Grill – current state

Server

Query ServiceMetastore ServiceMetricsQuery statistics(In progress)Scheduled queries(In progress)Query caching(In progress)

Client

Java ClientCLIJDBC Client

Execution EngineHive DriverJDBC DriverImpala Driver

Page 37: Grill at HadoopSummit

• Normalize query cost• Load balancing across execution engines• Alter meta hooks in StorageHandler• Authentication and authorization• Machine learning through Grill• Query quota management

Grill roadmap

Page 38: Grill at HadoopSummit

• Number of queries - 700 to 900 per day

• Number of dimension tables - 125

• Number of fact tables – 30

• Number cubes – 22

• Size of the data

• Total size – 170 TB

• Dimension data – 400 MB compressed per hour

• Raw data - 1.2 TB per day

• Aggregated facts- 53GB per day

Data ware house statistics

Page 39: Grill at HadoopSummit

Jira reference• https://issues.apache.org/jira/browse/HIVE-4115

Github source for Hive• https://github.com/InMobi/hive

Github source for grill• https://github.com/InMobi/grill

Documentation• http://inmobi.github.io/grill

Mailing lists• [email protected][email protected]

References

Page 40: Grill at HadoopSummit

Analytics at Inmobi - Problem areas and Motivation

Why Apache Hive

OLAP Model in Apache Hive

Query examples

Grill – Unified analytics

Demo

Agenda

Page 41: Grill at HadoopSummit

Demo