Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

39
Interactive Flink analytics with HopsWorks and Zeppelin Jim Dowling Ermias Gebermeskel www.hops.io @hopshadoop

Transcript of Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Page 1: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Interactive Flink analytics with HopsWorks and Zeppelin

Jim Dowling

Ermias Gebermeskel

www.hops.io@hopshadoop

Page 2: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Marketing 101: Celebrity Endorsements

*Turing Award Winner 2014, Father of Distributed Systems

Hi!

I’m Leslie Lamport* and

even though you’re not

using Paxos, I approve

this product.

Page 3: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Talk Overview

•Multi-tenancy in Hadoop

•Multi-tenancy in HopsWorks

•Free-Text Search of Hadoop Metadata in HopsWorks

•Zeppelin and Flink in HopsWorks

3

Page 4: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Goal: Multi-Tenancy and Data Sharing

4

Project NSA

Project X

No Unauthorized Copying/Cross-Linking of Data

DataSetowns

authorize

access

Page 5: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Access Control in Relational Databases

# How do we provide multi-tenancy for users alice and bob using two databases db1 and db2?

grant all privileges on db1.* to ‘alice'@‘%‘;

grant all privileges on db2.* to ‘bob'@‘%‘;

#More fine-grained privileges

grant SELECT privileges on db2.sensitiveTable

to ‘alice'@‘192.168.1.2‘;

5

What happens to the privileges if I call “drop table db2.sensitiveTable”?

Page 6: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Access Control in Hadoop: Apache Sentry

6How do you ensure the consistency of the policies and the data?

[Mujumdar’15]

Page 7: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Policy Editor for Sentry

7

Page 8: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Performance of Policy Enforcement Points (PEP)

8*https://docs.wso2.com/display/IS500/XACML+Performance+in+the+Identity+Server

Page 9: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

PEPs + Hadoop = Horse-Drawn Sportscar

9

Policy Enforcement Engines ≈ O(2,000) ops/sec

HopsFS Distributed Filesystem ≈ O(100,000) ops/sec

Horse-Drawn Sportscar

Page 10: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

HopsWorks

10

Page 11: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Users, DataSets, and Projects

In-Place Data Sharing - not Copying!

DataSet2DataSet1 DataSet3

Project 1 Project 2 Project 3

Page 12: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

User

•Authentication Provider

- JDBC Realm

- 2-Factor Authentication

- LDAP

12

Page 13: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Project

•Members

- Roles: Owner, Data Scientist

•DataSets

- Home project

- Can be shared

13

Page 14: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Project Roles

•Owner Privileges

- Import/Export data

- Manage Membership

- Share DataSets

•Data Scientist Privileges

- Write code

- Run code

- Request access to DataSets

14

We delegate administration of privileges to users

Page 15: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Sharing DataSets between Projects

16

The same as Sharing Folders in Dropbox

Page 16: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Delegate Access Control to HDFS

•HDFS enforces access control

•Convention for directories

•Hadoop and HopsWorksuse the same Users and Groups in a common DB

•UserId per Project

•GroupId per Project and DataSet

17

With Hadoop metadata in a DB, we guarantee policy integrity with Foreign Keys

Page 17: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Engine – HopsFS, HopsYARN

18

Page 18: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

HopsFS

19

Stateless NameNodes

NDB

Leader

HopsWorks

DataNodes

J2EE Server

HopsWorks

J2EE Server

Metadata & policies

Page 19: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

HopsYARN

20

ResourceMgrs

NDB

Scheduler

NodeManagers

Resource Trackers

HopsWorks

J2EE Server

HopsWorks

J2EE Server

Metadata & policies

Page 20: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Data Abstraction Layer (DAL)

21

NameNode

(Apache v2)

DAL API

(Apache v2)

NDB-DAL-Impl

(GPL v2)

Other Impl

(Other License)

hops-2.4.0.jar dal-ndb-2.4.0-7.4.7.jar

ResourceMgr

(Apache v2)

Page 21: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Hops Performance

22

Page 22: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

HopsFS Metadata Scaleout

23Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop

Page 23: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

HopsFS Throughput (Real Workload)

24Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances

Page 24: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

What else can we do with metadata in a DB?

25

Page 25: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

How ACME Inc. handles Free-Text Search

26

HDFS

In Theory

Unified Search and Update API

In Practice

Inconsistent Metadata

Page 26: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Global Search: Projects and DataSets

27

Page 27: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Project Search: Files, Directories

28

Page 28: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Design your own Extended Metadata

29

Page 29: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

MetaData Entry

30

Page 30: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Free Text Search with Consistent Metadata

31

Free-Text Search

Distributed Database

ElasticSearch

The Distributed Database is the Single Source of Truth.

Foreign keys ensure the integrity of Metadata.

MetaDataDesigner

MetaDataEntry

Page 31: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Flink and Zeppelin in HopsWorks

32

Page 32: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Batch Job Analytics

33

Page 33: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Interactive Analytics: Flink on Zeppelin

Page 34: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Other Features

•Audit Logs

•Erasure Coding Replication

•Online upgrade of Hops (and NDB)

•Automated Installation with Karamel

•Tinker friendly – easy to extend metadata!

35

Page 35: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Conclusions

•Hops is a next-generation distribution of Hadoop.

•HopsWorks is a frontend to Hops that supports true multi-tenancy, free-text search, interactive analytics with Zeppelin/Flink/Spark, and batch jobs.

•Looking for contributors/committers

- Pick-me-up on GitHub

36

www.hops.io

Page 36: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

The Team

Academics: Jim Dowling, Seif Haridi

PostDocs: Gautier Berthou

PhDs: Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh

MSc Students:K.Srijeyanthan “Sri”, Evangelos Savvidis, Seçkin Savaşçı, Ermias Gebremeskel

Alumini: Steffen Grohsschmiedt , Theofilos Kakantousis, Stig Viaene, Andre Moré, Qi Qi, Alberto Lorente, Hooman Peiro, Jude D’Souza, Nikolaos Stanogias, Daniel Bali, Ioannis Kirkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

37

Page 37: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

Hops[Hadoop For Humans]

www.hops.io@hopshadoop

Page 38: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

HDFS v2 Architecture

39

DataNodes

HDFS Client

Journal Nodes Zookeeper

Snapshot

NodeNameNode Standby

NameNode

Active-Standby Replication of NN Log

Agreement on the Active NameNode

Faster Recovery - Cut the NN Log

Doesn’t Scale Out

Page 39: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

YARN Architecture

40

NodeManagers

YARN Client

Zookeeper

ResourceMgr Standby

ResourceMgr

1. Master-Slave Replication of RM State

2. Agreement on the Active ResourceMgr