Apache Kylin Extreme OLAP Engine for Big Data

34
Apache Kylin Extreme OLAP Engine for Big Data Luke Han, Yang Li 2015-05-06 http://kylin.io | @ApacheKylin

Transcript of Apache Kylin Extreme OLAP Engine for Big Data

Page 1: Apache Kylin Extreme OLAP Engine for Big Data

Apache KylinExtreme OLAP Engine

for Big Data

Luke Han, Yang Li

2015-05-06

http://kylin.io | @ApacheKylin

Page 2: Apache Kylin Extreme OLAP Engine for Big Data

About us Luke Han | [email protected] | @lukehq

Apache Kylin PMC member & Product Owner

Sr. Product Manager of eBay GDI

from Shanghai China

Yang Li | [email protected]

Apache Kylin PMC member & Tech Leader

Sr. Architect of eBay GDI

from Shanghai China

Page 3: Apache Kylin Extreme OLAP Engine for Big Data

Agenda About Apache Kylin Feature Highlights Tech Highlights Roadmap Q&A

Page 4: Apache Kylin Extreme OLAP Engine for Big Data

What

kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form

Extreme OLAP Engine for Big DataKylin is an open source Distributed Analytics Engine, contributed

by eBay Inc., provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

• Open Sourced on Oct 1st, 2014• Be accepted as Apache Incubator Project on Nov 25th, 2014

• http://kylin.io (http://kylin.incubator.apache.org)

@ApacheKylin

Page 5: Apache Kylin Extreme OLAP Engine for Big Data

Why

Happiness

Latency10s

size

Page 6: Apache Kylin Extreme OLAP Engine for Big Data

Balance Between Space and Time

time, item

time, item, location

time, item, location, supplier

time item location supplier

time, location

Time, supplier

item, location

item, supplier

location, supplier

time, item, supplier

time, location, supplier

item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells

1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>2. (9/15, milk, Urbana, *) - <time, item, location>3. (*, milk, Urbana, *) - <item, location>4. (*, milk, Chicago, *) - <item, location>5. (*, milk, *, *) - <item>

• Cuboid = one combination of dimensions

• Cube = all combination of dimensions (all cuboids)

OLAP Cube

Page 7: Apache Kylin Extreme OLAP Engine for Big Data

How

Map Reduce

Kylin

BI Tools, Web App…

ANSI SQL

Page 8: Apache Kylin Extreme OLAP Engine for Big Data

Agenda About Apache Kylin Feature Highlights Tech Highlights Roadmap Q&A

Page 9: Apache Kylin Extreme OLAP Engine for Big Data

Feature Highlights• Extremely Fast OLAP Engine at scale

• ANSI SQL Interface on Hadoop

• Seamless Integration with BI Tools, like Tableau

• Interactive Query Capability

• MOLAP Cube

• Incremental Build of Cubes

• Approximate Query Capability for Distinct Count (HyperLogLog)

• Leverage HBase Coprocessor for query latency

• Job Management and Monitoring

• User friendly Web GUI for manage, build, monitor and query cubes

• Security capability to set ACL at Cube/Project Level

• Support LDAP Integration

Page 10: Apache Kylin Extreme OLAP Engine for Big Data

Define Data Model

Page 11: Apache Kylin Extreme OLAP Engine for Big Data

Manage Jobs

Page 12: Apache Kylin Extreme OLAP Engine for Big Data

Explore the Data

Page 13: Apache Kylin Extreme OLAP Engine for Big Data

Interactive with BI Tool - Tableau

Page 14: Apache Kylin Extreme OLAP Engine for Big Data

eBay- 90% query < 5 seconds

Baidu- Baidu Map internal analysis

Many other Proof of Concepts- Huawei, Bloomberg Law, British GAS, JD.com, Microsoft, StubHub, Tableau …

Case Cube Size Raw Records

User Session Analysis 26 TB 28+ billion rows

Traffic Analysis 21 TB 20+ billion rows

Behavior Analysis 560 GB 1.2+ billion rows

—from mailing list

Who are using Kylin?

Page 15: Apache Kylin Extreme OLAP Engine for Big Data

Agenda About Apache Kylin Feature Highlights Tech Highlights Roadmap Q&A

Page 16: Apache Kylin Extreme OLAP Engine for Big Data

Cube Build Engine(MapReduce…)

SQL

Low Latency - SecondsMid Latency - MinutesRouting

3rd Party App(Web App, Mobile…)

Metadata

SQL-Based Tool(BI Tools: Tableau…)

Query Engine

HadoopHive

REST API JDBC/ODBC

Online Analysis Data FlowOffline Data Flow

Clients/Users interactive with Kylin via SQL

OLAP Cube is transparent to users

Star Schema Data Key Value Data

Data Cube

OLAPCube(HBase)

SQL

REST Server

Kylin Architecture Overview

Page 17: Apache Kylin Extreme OLAP Engine for Big Data

Cube: …Fact Table: …Dimensions: …Measures: …Storage(HBase): …

Fact

Dim Dim

Dim

SourceStar Schema

row A

row B

row C

Column Family

Val 1

Val 2

Val 3

Row Key Column

Target HBase Storage

MappingCube Metadata

End User Cube Modeler Admin

Data Modeling

Page 18: Apache Kylin Extreme OLAP Engine for Big Data

Cube Build Job Flow

Page 19: Apache Kylin Extreme OLAP Engine for Big Data

How to Store Cube - HBase Schema

Page 20: Apache Kylin Extreme OLAP Engine for Big Data

SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNTFROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_idWHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New'GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name

OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])

Kylin Query Engine - Explain Plan

Page 21: Apache Kylin Extreme OLAP Engine for Big Data

Cube Optimization Curse of Dimensionality

N dimension cube has 2N cuboid Full Cube vs. Partial Cube

Huge Data Volume Dictionary Encoding Incremental Building

Page 22: Apache Kylin Extreme OLAP Engine for Big Data

Full Cube- Pre-aggregate all dimension combinations

- “Curse of dimensionality”: N dimension cube has 2N cuboid.

Partial Cube- To avoid dimension explosion, we divide the dimensions into different aggregation

groups

- 2N+M+L 2N + 2M + 2L

- For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid

number will reduce from 1 Billion to 3 Thousands

- 230 210 + 210 + 210

- Tradeoff between online aggregation and offline pre-aggregation

Full Cube vs. Partical Cube

Page 23: Apache Kylin Extreme OLAP Engine for Big Data

Partical Cube

Page 24: Apache Kylin Extreme OLAP Engine for Big Data

Incremental Build

Page 25: Apache Kylin Extreme OLAP Engine for Big Data

What’s Next Improve cube algorithm Cube by segments, 30%-50% faster Build delay down to tens of minutes

Streaming cubing Analyze real-time data Build delay down to seconds

Spark

Page 26: Apache Kylin Extreme OLAP Engine for Big Data

Cube by Layer

The current algorithm- Many MRs, the number of

dimensions

- Huge shuffles, aggregation at

reduce side, 100x of total cube

size

Full Data

0-D Cuboid

1-D Cuboid

2-D Cuboid

3-D Cuboid

4-D Cuboid

MR

MR

MR

MR

MR

Page 27: Apache Kylin Extreme OLAP Engine for Big Data

Cube by Segments

The to-be algorithm,

30%-50% faster- 1 round MR

- Reduced shuffles, map side

aggregation, 20x total cube

size

- Hourly incremental build done

in tens of minutes

Data Split

Cube Segment

Data Split

Cube Segment

Data Split

Cube Segment

……

Final Cube

Merge Sort(Shuffle)

mapper mapper mapper

Page 28: Apache Kylin Extreme OLAP Engine for Big Data

Streaming Cubing

Cube is great but…- Cube takes time to build, how about real-time analysis?

- Sometimes we want to drill down to row level information

Streaming cubing- Build micro cube segments from streaming

- Use inverted index to capture last minute data

Page 29: Apache Kylin Extreme OLAP Engine for Big Data

Streamingseconds d

elayLast Hour

Inverted Index

Before Last Hour

Cube

Kylin Lambda Architecture

minutes delay

Que ry E ngine

ANSI SQ

L

Hybrid StorageInterface

Page 30: Apache Kylin Extreme OLAP Engine for Big Data

Adding Spark Support Cubing Efficiency MR is not optimal framework Spark Cubing Engine

Source from SparkSQL Read data from SparkSQL instead of Hive

Route to SparkSQL Unsupported queries be coved by SparkSQL

Page 31: Apache Kylin Extreme OLAP Engine for Big Data

Agenda About Apache Kylin Feature Highlights Tech Highlights Roadmap Q&A

Page 32: Apache Kylin Extreme OLAP Engine for Big Data

Future2015 2016

Kylin Evolution Roadmap20142013

Initial

Prototype for MOLAP• Basic end to end POC

MOLAP• Incremental Refresh• ANSI SQL• ODBC Driver• Web GUI• Tableau• ACL• Open Source

StreamingOLAP• Streaming OLAP• JDBC Driver• New UI• Excel• SparkSQL• … more

TBD

Sep, 2013

Jan, 2014

Oct, 2014

H1, 2015

HybridOLAP• Lambda Arch• Automation• Capacity

Management• Spark • … more

Next Gen• Adv OLAP Functions• In-Memory Analysis

(TBD)• Mobile (TBD)• … more

Page 33: Apache Kylin Extreme OLAP Engine for Big Data

Kylin Ecosystem

Kylin Core Fundamental framework of Kylin OLAP Engine

Extension Plugins to support for additional functions and features

Integration Lifecycle Management Support to integrate with other

applications

Interface Allows for third party users to build more features via user-

interface atop Kylin core

Kylin OLAPCore

Extension Security Redis Storage Spark Engine Docker

Interface Web Console Customized BI Ambari/Hue Plugin

Integration ODBC Driver ETL Drill SparkSQL

Page 34: Apache Kylin Extreme OLAP Engine for Big Data

http://kylin.io

If you want to go fast, go alone.If you want to go far, go together.

--African [email protected]