Apache Kylin - OLAP Cubes for SQL on Hadoop

42
© 2014 MapR Technologies 1 © 2014 MapR Technologies

Transcript of Apache Kylin - OLAP Cubes for SQL on Hadoop

Page 1: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 1© 2014 MapR Technologies

Page 2: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 2

Who I am

Ted Dunning, Chief Applications Architect, MapR Technologies

Email [email protected] [email protected]

Twitter @Ted_Dunning

VP Incubator

Email [email protected]

Twitter @ApacheMahout @ApacheDrill

Credit for slides to Luke Han and the Kylin dev team

Page 3: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 3

Kylin Committers

ankur Ankur Bansal

jiangxu Jiang Xu

liyang Li Yang

lukehan Luke Han*

mahongbin Hongbin Ma

xduo Xiaodong Duo

yisong George Song

jhyde Julian Hyde

The real deal

Calcite plenipotentiary

Page 4: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 4

Agenda

• What is Apache Kylin?

• Features & Tech Highlights

• Performance

• Roadmap

• Q & A

Page 5: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 5

What is Kylin?

Extreme OLAP Engine for Big Data

Kylin is an open source Distributed Analytics Engine from (originally from eBay) that

provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop for

extremely large datasets

kylin / ˈkiːˈlɪn / 麒麟

--n. (in Chinese art) a mythical animal of composite form

• Open Sourced on Oct 1st, 2014

• Accepted into incubation November, 2014

• Preparing for first Apache release

Page 6: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 6

Big Data Obligatory Slide

• More and more data becoming available on Hadoop

• Limitations in existing Business Intelligence (BI) Tools

– Limited support for Hadoop

– Data size growing exponentially

– High latency of interactive queries

– Scale-Up architecture

• Challenges to adopt Hadoop as interactive analysis system

– Majority of analyst groups are SQL savvy

– No mature SQL interface on Hadoop

– OLAP capability on Hadoop ecosystem not ready yet

Page 7: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 7

Goals

• Sub-second query latency on billions of rows

• ANSI SQL for both analysts and engineers

• Full OLAP capability to offer advanced functionality

• Seamless Integration with BI Tools

• Support for high cardinality and dimensionality

• High concurrency – thousands of end users

• Distributed and scale out architecture for large data volume

Page 8: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 8

Possible Strategies

• Build from scratch

– A grand tradition

– Large-scale SQL support is much harder than it looks

– Huge level of distraction

• Patch Hive

– Not feasible due to design assumptions in Hive

– Weak optimizer

– Hive isn’t standard SQL anyway

– (but isn’t Hive moving to Calcite?)

Page 9: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 9

Kylin’s Strategy

• Use Calcite as SQL core

– Real SQL

– Real cost-based optimizer

– Already in Apache

– Provides linkage to Apache Drill and future of Hive

• Build cubes externally

– Don’t care which tools, currently Hive, soon Spark

• Use Calcite’s Rex interpreter

– Assumes final aggregations fit on one machine

• Possibly integrate with Drill at some point for parallel execution

Page 10: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 10

Transaction

Operation

Strategy

Analytics Query Taxonomy

High Level Aggregation

• Very High Level, e.g GMV by site by vertical by weeks

Analysis Query

• Mid-level, e.g GMV by site by vertical, by category (level x) past 12 weeks

Drill Down to Detail • Detail Level (Summary Table)

Low Level Aggregation

• First Level Aggregation

Transaction Level

• Transaction Data

OLAP

Kylin is designed to accelerate 80+% of analytics queries on Hadoop

OLTP

Page 11: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 11

Technical Challenges

• Huge volume data

– Table scan

• Big table joins

– Data shuffling

• Analysis on different granularity

– Runtime aggregation expensive

• Map Reduce job

– Batch processing

Page 12: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 12

How Cubes Work

• Start with a simple table– revenue,time,item,location,supplier

• Build a table of aggregates for every combination of fieldsselect sum(revenue), max(revenue), supplier from tbl group by time,item,location;

select sum(revenue), max(revenue), location,supplier from tbl group by time,item;

select sum(revenue), max(revenue), location from tbl group by time,item,supplier;

• Then transform queries using appropriate magicselect sum(revenue), city from tbl join location_details

where state = ‘MN’ group by city

select … from (select sum(),location from cube) join location_details

where state = ‘MN’ group by city

Page 13: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 13

How Cubes Don’t Work

• Total number of cubes is exponential in columns

• High cardinality can result in large cubes

• Skewed data can make cubes larger as original data

• Magic may be insufficient to recognize cubable queries

• Keeping cubes up to date can be hard

• Forget OLTP thoughts like pervasive transactions

Page 14: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 15

OLAP Cube – Balance between Space and Time

Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>2. (9/15, milk, Urbana, *) - <time, item, location>3. (*, milk, Urbana, *) - <item, location>4. (*, milk, Chicago, *) - <item, location>5. (*, milk, *, *) - <item>

Cuboid = one combination of dimensions

Cube = all combinations of dimensions

1111

0111 1011 1101 1110

0011 0101 0110 1001 1010 1100

0001 0010 0100 1000

0000

Page 15: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 16

From Relational to Key-Value

Page 16: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 17

Kylin Architecture Overview

17

Cube Build Engine(MapReduce…)

SQL

Low Latency - SecondsMid Latency - Minutes

Routing

3rd Party App(Web App, Mobile…)

Metadata

SQL Tools(BI Tools: Tableau…)

Query Engine

Hadoop

Hive

REST API JDBC/ODBC

Online Analysis Data Flow

Offline Data Flow

Clients/Users interactive with Kylin via

SQL

OLAP Cube is transparent to users

Star Schema Data Key Value Data

Data CubeOLAPCube(HBase)

SQL

REST Server

Page 17: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 18

Kylin Depends on Hadoop Eco-system

• Hive

– Input source, pre-join star schema during cube building

• MapReduce

– Aggregate metrics during cube building

• HDFS

– Store intermediate files during cube building

• HBase

– Store and query data cubes

• Calcite

– SQL parsing, code generation, optimization

Page 18: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 19

Agenda

• What is Apache Kylin?

• Features & Tech Highlights

• Performance

• Roadmap

• Q & A

Page 19: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 20

Kylin Highlights

• Extremely Fast OLAP Engine at Scale

Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data to seconds

• ANSI SQL Interface on Hadoop

Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions

• Seamless Integration with BI Tools

Kylin currently offers integration capability with BI Tools like Tableau.

• Interactive Query Capability

Users can interact with Hadoop data via Kylin at sub-second latency

• MOLAP Cube

User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records

Page 20: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 21

More Highlights

• Compression and Encoding Support

• Incremental Refresh of Cubes

• Approximate Query Capability for distinct Count (HyperLogLog)

• Leverage HBase Coprocessor for query latency

• Job Management and Monitoring

• Easy Web interface to manage, build, monitor and query cubes

• Security capability to set ACL at Cube/Project Level

• Support LDAP Integration

Page 21: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 22

Cube Designer

Page 22: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 23

Job Management

Page 23: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 24

Query and Visualization

Page 24: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 25

Tableau Integration

Page 25: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 26

Data Modeling Points of View

Cube: …

Fact Table: …

Dimensions: …

Measures: …

Storage(HBase): …

Fact

Dim Dim

Dim

Source

Star Schema

row A

row B

row C

Column Family

Val 1

Val 2

Val 3

Row Key Column

Target

HBase Storage

Mapping

Cube Metadata

End User Cube Modeler Admin

Page 26: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 27

Process FlowSource

Joined tables

Build

dict

Dimension dictionaries

Hive

Page 27: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 28

Process Flow

Joined n cuboid

n-1 cuboids

Apex cuboid

MR MR

Dimensiondictionaries

MR

Page 28: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 29

Process flow

n cuboid n-1 cuboids Apex cuboid

MR

H-files

HBase

Page 29: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 30

How To Store Cube? – HBase Schema

Page 30: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 31

SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,

test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT

FROM test_kylin_fact

LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt

LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id

LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id

WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New'

GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name,

test_kylin_fact.lstg_format_name,test_sites.site_name

OLAPToEnumerableConverter

OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4],

SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8])

OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()])

OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5],

SITE_NAME=[$23], PRICE=[$0])

OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))])

OLAPJoinRel(condition=[=($2, $25)], joinType=[left])

OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left])

OLAPJoinRel(condition=[=($4, $12)], joinType=[left])

OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])

OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]])

OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]])

OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])

Query Engine – Kylin Explain Plan

Page 31: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 32

Now Let’s Make it Really Work

• Full Cube– Pre-aggregate all dimension combinations

– “Curse of dimensionality”: N dimension cube has 2N cuboid.

• Partial Cube– To avoid dimension explosion, we divide the dimensions into different

aggregation groups• 2N+M+L

2N + 2M + 2L

– For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid count is reduced from 1 Billion to 3 thousand

• 230 210 + 210 + 210

– Tradeoff between online aggregation and offline pre-aggregation

Page 32: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 34

Incremental Cube Building

Page 33: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 36

Agenda

• What is Apache Kylin?

• Features & Tech Highlights

• Performance

• Roadmap

• Q & A

Page 34: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 37

# Query Type Return Dataset Query

On Kylin (s)

Query

On Hive (s)

Comments

1 High Level

Aggregation

4 0.129 157.437 1,217 times

2 Analysis Query 22,669 1.615 109.206 68 times

3 Drill Down to Detail 325,029 12.058 113.123 9 times

4 Drill Down to Detail 524,780 22.42 6383.21 278 times

5 Data Dump 972,002 49.054 N/A

0

50

100

150

200

SQL #1 SQL #2 SQL #3

Hive

Kylin

High Level

Aggregation

Analysis Query

Drill Down to Detail

Low Level Aggregati

on

Transaction Level

Based on 12+B records

Kylin vs. Hive

Page 35: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 38

Performance Scaleout

Linear scale out with more nodes

Page 36: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 39

Performance - Query Latency

99 %-ile

95 %-ile

Page 37: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 40

Agenda

• What is Apache Kylin?

• Features & Tech Highlights

• Performance

• Roadmap

• Q & A

Page 38: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 41

201520142013

Initial

Prototype for MOLAP• Basic end to end

POC

MOLAP• Incremental

Refresh

• ANSI SQL

• ODBC Driver

• Web GUI

• ACL

• Open Source

HOLAP• Streaming OLAP

• JDBC Driver

• New UI

• Excel Support

• … more

Next Gen

• Automation

• Capacity Management

• In-Memory Analysis (TBD)

• Spark (TBD)

• … more

TBD

Future…

Sep, 2013

Jan, 2014

Sep, 2014

Q1, 2015

Kylin History and Roadmap

Page 39: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 42

Kylin Ecosystem

• Kylin Core– Fundamental framework of Kylin

OLAP Engine

• Extension– Plugins to support for additional

functions and features

• Integration– Lifecycle Management Support to

integrate with other applications

• Interface– Allows for third party users to build

more features via user-interface atop Kylin core

• Driver– ODBC and JDBC Drivers

Kylin OLAP

Core

Extension Security

Redis Storage

Spark Engine

Docker

Interface Web Console

Customized BI

Ambari/Hue Plugin

Integration ODBC Driver

ETL

Drill

SparkSQL

Page 40: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 44

If you want to go fast, go alone.

If you want to go far, go together.

--African Proverb

Page 41: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 45

Agenda

• What is Apache Kylin?

• Features & Tech Highlights

• Performance

• Roadmap

• Q & A

Page 42: Apache Kylin - OLAP Cubes for SQL on Hadoop

© 2014 MapR Technologies 46

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies