Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

31
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis Rohit Choudhary & Jeff Zhang June 28, 2016

Transcript of Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

Page 1: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin + Livy: Bringing Multi Tenancyto Interactive Data AnalysisRohit Choudhary & Jeff ZhangJune 28, 2016

Page 2: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Web-based notebook that enables

interactive data analytics.

You can make beautiful data-driven,

interactive and collaborative

documents with SQL, Scala and more

What’s Apache Zeppelin?

Page 3: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interactive Analysis 1.0 (Spark-shell)

Page 4: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interactive Analysis 2.0 (Zeppelin)

Spark Interpreter

Page 5: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interactive Analysis 3.0 (Zeppelin + Livy)

Livy Interpreter

Page 6: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Open Source Activity

Page 7: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Quick Stats: Zeppelin

Zeppelin graduated in May 2016 and is now TLP Incubated by Apache Foundation, since Dec- 2014 9 Committers, 120+ contributors, growing list 1000+ JIRAs filed 900 PRs via the community Zeppelin just got a new friend “R”

Page 8: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Recent Updates

Multi-tenancy with Livy Generic JDBC Interpreter

– Hive, Phoenix , RedShift – Postgres, MySql– Several others

Notebook Authentication and Authorization UI Automation through Selenium Security for other interpreters (on its way)

Page 9: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Usage Patterns & Feedback Cluster monitoring, memory analysis Telecom data usage, Concert attendees travel patterns

Page 10: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Upcoming GA with HDP 2.5 & Ambari 2.4.0, ETA – End July

Page 11: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architecture & Usage

Page 12: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin Architecture

Current Interpreter Support HDFS PySpark, SparkR, Spark Hive, Phoenix, SQL Shell …

Page 13: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin Features

Collate/Load Data

Collate/Load data from existing data sources, load from external CSVs. i.e. Eureka, Smartsense

Visualize Robust visualization mechanism to visualize data, and enable insights

Collaborate Notebook base collaboration, export Notebooks, soon to be added, tagging to Notebook generated data

Page 14: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Popular Usage Scenarios

Customized Dashboards

Intended for usage towards customized dashboards for Big Data clusters

Security Analytics

Understanding nature of data coming through multiple sources and analyzing the effects of it

Bio-sciences Medical research companies are interested in using this for their research

Page 15: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Bringing Multi-tenancy to Zeppelin

Page 16: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Multi-Tenancy: Motivation

Supporting workloads of multiple customers

Supporting multiple LOBs (lines of business), on a single data systems

Support fine grained audits

Inability to provision capacity for multiple user groups

Inability to Audit user actions, as all jobs are run via ‘zeppelin’ proxy user

Inability to share state/data with other users as well

Objectives Requirements

Page 17: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin Livy Interaction

LDAP

Zeppelin

Shiro

Spark

Yarn

Livy

Ispark GroupInterpreter

SPNego: Kerberos Kerberos

Security Across Zeppelin-Livy-Spark

Livy APIs

Page 18: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Deep dive on Livy

Page 19: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Livy

Livy ServerLivy Client

Http

Http (RPC)

Http (RPC)

Livy is an open source REST interface for interacting with Spark from anywhere.

Spark Interactive Session

SparkContext

Spark Batch Session

SparkContext

Page 20: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why we need Livy with Zeppelin

Reduce the pressure on client machine

Make the job submission/monitoring easy

Customize the job schedule

Page 21: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interactive Session – Create Session

21

3

4

curl -X POST --data '{"kind": "spark"}' -H "Content-Type: application/json" localhost:8998/sessions

{"state":"starting","proxyUser":”null","id":1,"kind":"spark","log":[]}

Request

Response

Livy Client

Livy Server

Spark Interactive Session

SparkContext

Page 22: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interactive Session – Execute Code

{"id":0,"state":"running","output":null}

Request

Response

curl http://localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sc.parallelize(0 to 100).sum()"}'

21

3

4

Livy Client

Livy Server

Spark Interactive Session

SparkContext

Page 23: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SparkContext Sharing

Livy Server

Client 1

Client 2

Client 3

Session-1

Session-1

Session-2 Session-2

Session-1SparkSession-1

SparkContext

SparkSession-2

SparkContext

Page 24: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Livy Security

Client Livy Server(Impersonation)

Shared SecretSpengoSparkSession

• Only authorized users can launch spark session / submit code

• Each user can access his own session

• Only Livy server can submit job securely to spark session

Page 25: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SPNEGO

Client(Kerbrose TGT)

Livy Server(SPENGO enabled)

Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO), often pronounced "spen-go”

It is a GSSAPI "pseudo mechanism" used by client-server software to negotiate the choice of security technology.

Http Get http://site/a.html

Error 401 Unauthorized

Http Get Request Authorization: Negotiation

Http Get Request

Page 26: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Impersonation

Alice(Kerberos TGT)

Shared Secret

Bob(Kerberos TGT)

Shared SecretSpengo

Spengo

Livy Server(super user: livy)

Spark Session

Spark Session

Page 27: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Shared Secret

1. Livy Server generate secret key

2. Livy Server pass secret key to spark session when launching spark session

3. Use the secret key to communicate with each other

Spark SessionShared Secret

Livy Server

Page 28: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Multi Tenant: Zeppelin Demo

Page 29: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin Direction

Workspaces and Collaboration Customizable Visualization

– Helium– Custom, data type based visualization (Geolocation/Maps)

Enterprise Readiness– Bring security to all interpreters– Performance improvements

Collaboration Data Lineage

Page 30: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Q & A

Page 31: Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You