Apache Lens at Hadoop meetup

32
Apache Lens Unified analytics platform Sharad Agarwal Amareshwari Sriramadasu

Transcript of Apache Lens at Hadoop meetup

Apache LensUnified analytics platform

Sharad Agarwal

Amareshwari Sriramadasu

Data warehouse evolution at InMobi

Introduction to APACHE LENS

LENS Architecture and OLAP Model

Query examples and Demo

Lens roadmap

Agenda

Reporting warehouse : Generation 1

Reporting data in RDBMS

Aggregations/materialized views in DB

~ 1.5 TB

Generation 1: RDBMS

Challenges:

Data Scale: Loading of data taking ~ 24 hrs

Analysis only upto 3 dimensions

Heavy queries stalling other user queries

Unable to move fast with new reporting requirments

Generation 1 : Reporting data in RDBMS

Reporting warehouse : Generation 2

Small and summarized data in Columnar Database

- Rich Dashboards

Granular data in Hadoop

- Adhoc analysis

Generation 2: Hadoop + Columnar DB

Grown to:

Hadoop ~ 250 TB

Columnar DB ~ 8 TB

Generation 2: Hadoop + Columnar DB

Challenges:

Maintaining two lines of independent data warehousing systems

Data discrepancies

Schema management

Learning curve for Users

Duplicate datasets

Inefficient Utilization

Proprietary OLAP and MR on Hadoop

Generation 2: Hadoop + Columnar DB

Reporting warehouse : Generation 3

Platform to enable multi-dimensional queries in a unified way over

datasets stored in multiple warehouses

- OLAP Cube abstraction

- Data discovery by providing single metadata layer

- Unified access to data by integrating Hive with other traditional

warehouses

Generation 3: Apache Lens (earlier called Grill)

- Queries get pushed to where data resides

- Central Catalog management: All applications talk same language

- Query analytics for optimizing hot datasets

- Workload based experimentation with newer systems: AWS

Redshift, Apache Spark, Apache Tez

Generation 3: Apache Lens

Machine learning workflow in Apache Lens using

Apache Spark

Generation 4 (Future) : Advanced Analytics

Problem areas and Motivation

Introduction to APACHE LENS

LENS Architecture and OLAP Model

Query examples and Demo

Lens roadmap

Agenda

Lens Architecture

Data Layout – Fact data

Dim-cuts, Measures

Cost

Data Layout – Dimension data

Subsetm (am < am-1)

Subset2 (a2 < a1)

Subset1 (a1 < ax)

All attributes (ax)

Cost

Data Layout – Snowflake

Aggr Factk : measures (mak <=

ma(k-1)), dims (dak < da(k-1))

…..

Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1)

Aggr Fact1 : measures (ma1 <= mr), dimensions (da1 < dr)

Raw Fact: measures (mr), dimensions(dr)

Dim2_1

Dim3

Dim2

Dim4_1

Dim4

Dim1

Demo

Data Model

Fact1: d1,m2,m3,m4 (HDFS)

Fact2: d1,d3,m2(sum),m3(max),m4(mi

n) (HDFS)

Raw :d1,d2,d3,m1,m2,m3,m4 (HDFS)

Dim_table2 : id, name,

detail2 (HDFS)

Dim_table3 : id, name,detail,d2id

(DB)

Dim_table4 : id, name, d2id (HDFS, DB)

Dim_table : id, name, detail, d2id (HDFS)

SAMPLE_CUBE

SAMPLE_DIM2 SAMPLE_DIM

SAMPLE_DB_DIM

d3->id

d2id->id

d2id->id

Apache Lens : Features

Server

Query life cycle management

Catalog service

Query statistics

Scheduling queries

Authorization

Query caching

Estimate expected query time

Client

Java Client

CLI

JDBC Client

Simple UI

Execution Engine

Hive Driver

JDBC Driver

Spark Driver

Github source for Apache Lens

• https://git-wip-us.apache.org/repos/asf/incubator-lens.git

• https://github.com/apache/incubator-lens

Documentation

• Soon :http://lens.incubator.apache.org

• Right now : http://inmobi.github.io/grill/

Mailing lists

[email protected]

[email protected]

References

Thank you!

Backup

• SELECT ( city.name ), ( city.stateid ) FROM c2_citytable city LIMIT 100

• SELECT ( city.name ), ( city.stateid ) FROM c1_citytable city WHERE (city.dt = 'latest') LIMIT 100

cube select name, stateid from city limit 100

Example query

Example query

• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable city ON ((testcube.cityid)= (city.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (city.dt = 'latest')

GROUP BY(city.name)

cube select city.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)

Data Model – Storage

Sto

rage • Name

• End point

• Properties

• Ex : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2

Data Model – Fact Table

Fact table

Cube

Fact table

StorageFact Table • Columns

• Cube that it belongs

• Storages on which it is present and the associated update periods

Data Model – Dimension table

Dim

ensio

n T

able • Columns

• Dimension to which it belongs

• Storages on which it is present and associated snapshot dump period, if any.

Dimension Table

Dimension

Dimension table

Storage

Data Model – Storage tables and partitionsS

tora

ge t

ab

le • Belongs to fact/dimension

• Associated storage descriptor

• Partitioned by columns

• Naming convention – storage name followed by fact/dimension name

• Partition can override its storage descriptor

• Fact storage table

Fact table

• Dimension storage table

Dimension table

Resolve candidate tables and storages

Automatically resolve joins, aggregations

Allows SQL over Cube QL

Queries can span multiple storages

Accepts multiple time ranges in query

All Hive QL features

Query features

Adhoc querying system Internal Dashboards

Customer facing Dashboards and Reporting

Analytics systems at Inmobi