Strata Hadoop Hopsworks

36
HopsWorks: Multi-Tenant Hadoop-as-a-Service Jim Dowling Associate Prof @ KTH Stockholm Senior Researcher @ SICS CEO @ Logical Clocks AB Hadoop Strata, London, June 3 rd 2016 www.hops.io @hopshadoop

Transcript of Strata Hadoop Hopsworks

Page 1: Strata Hadoop Hopsworks

HopsWorks: Multi-Tenant Hadoop-as-a-Service

Jim Dowling Associate Prof @ KTH Stockholm

Senior Researcher @ SICSCEO @ Logical Clocks AB

Hadoop Strata, London, June 3rd 2016

www.hops.io @hopshadoop

Page 2: Strata Hadoop Hopsworks

Open, Collaborative Software Development

Page 3: Strata Hadoop Hopsworks

Hadoop Administrator

Enterprise-Level Hadoop: Admin-in-the-Loop

3

Page 4: Strata Hadoop Hopsworks

Access Control in Hadoop: Apache Ranger

>hdfs dfs -chmod -R 000 /apps/hive

4[http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]

Page 5: Strata Hadoop Hopsworks

Access Control in Hadoop: Apache Sentry

5How do you ensure the consistency of the policies and the data?

[Mujumdar’15]

Page 6: Strata Hadoop Hopsworks

Access Control in Relational Databases# Multi-tenancy for alice and bob on db1 and db2grant all privileges on db1.* to ‘alice'@‘%‘;grant all privileges on db2.* to ‘bob'@‘%‘;

6

Consistency of security and privileges guaranteed with foreign keys.

drop db2; // deletes associated privileges

Page 7: Strata Hadoop Hopsworks

Metadata Totem Poles in Hadoop

7Eventual Consistency

Page 8: Strata Hadoop Hopsworks

Why the separation of Metadata and Data?

8

Page 9: Strata Hadoop Hopsworks

HDFS v2

9

DataNodes

HDFS Client

Journal Nodes Zookeeper Nodes

NameNode StandbyNameNode

Max 200 GB

Page 10: Strata Hadoop Hopsworks

YARN

10

NodeManagers

YARN Client

Zookeeper Nodes

ResourceMgr StandbyResourceMgr

Tinker-Friendly?

Page 11: Strata Hadoop Hopsworks

There is another way…..

11

Page 12: Strata Hadoop Hopsworks

Hops: Distributed Metadata for Hadoop

12

Page 13: Strata Hadoop Hopsworks

HopsFS Architecture

13

NameNodes

NDB

Leader

HDFS Client

DataNodes

> 12 TB

> 2.6 XThroughput

Page 14: Strata Hadoop Hopsworks

HopsYARN Architecture

14

ResourceMgrs

NDB

Scheduler

YARN Client

NodeManagers

Resource TrackersLeader Election forFailed Scheduler

Up to 10KNode Clusters

Page 15: Strata Hadoop Hopsworks

Challenges for Project-Level Multi-Tenancy

15

(How can we introduce GitHub-style projects to Hadoop?)

Page 16: Strata Hadoop Hopsworks

Problem: Sensitive Data needs its own Cluster

16

NSA DataSet

User DataSet

has access to

has access to

Copy/cross-link between data sets

Alice has only one Kerberos Identity. Neither attribute-based access control nor dynamic roles supported in Hadoop.

Alice

Page 17: Strata Hadoop Hopsworks

Solution: Project-Specific UserIDs

17

Project NSA

Project UsersMember of

NSA__Alice

Users__Alice

Member of

HDFS enforcesaccess control

How can we share DataSets between Projects?

Page 18: Strata Hadoop Hopsworks

Sharing Data with First-Class DataSets

18

Project NSA

Project UsersMember of

DataSetowns

Add members of Project NSA to the DataSet group

NSA__Alice

Users__Alice

Member of

Page 19: Strata Hadoop Hopsworks

HopsWorks: Project-Level Multi-Tenancy

19

Page 20: Strata Hadoop Hopsworks

System UserAuthentication Provider

- JDBC Realm- 2-Factor Authentication- LDAP

20

Page 21: Strata Hadoop Hopsworks

HopsWorks Enforces Dynamic Roles

21

[email protected]

NSA__Alice

Authenticate

Users__Alice

HopsWorks

HopsFS

HopsYARN

ProjectsSecure

Impersonation

Kafka

X.509 Certificates

Page 22: Strata Hadoop Hopsworks

X.509 Certificate Per Project-Specific User

22

[email protected]

Authenticate

Add/DelUsers

Distributed Database

Insert/Remove CertsProject Mgr

RootCA

ServicesHadoopSparkKafkaetc

Cert Signing Requests

Page 23: Strata Hadoop Hopsworks

ProjectA project has an ownerA project is a collection of

- Members- HDFS DataSets - Kafka Topics- Notebooks and Jobs

A project has quotas

23

projectdataset 1

dataset N

Topic 1

Topic N

Kafka

HDFS

Page 24: Strata Hadoop Hopsworks

Project RolesData Owner Privileges

- Import/Export data- Manage Membership- Share DataSets, Topics

Data Scientist Privileges- Write and Run code

24

We delegate administration of privileges to usersjust like GitHub

Page 25: Strata Hadoop Hopsworks

Elastic HadoopEach Project has:YARN CPU Quota (in mins)HDFS Storage Quota (in GB/TB)Uber-Style Pricing

25

Page 26: Strata Hadoop Hopsworks

Sharing DataSets/Topics between Projects

26

The same as Sharing Folders in Dropbox

Page 27: Strata Hadoop Hopsworks

Delegate Access Control to HDFSHDFS enforces access control- UserID per Project- GroupID per Project and

DataSetMetadata Integrity using Foreign Keys- Removing a project removes

all users, groups, and (optionally) DataSets

27

Page 28: Strata Hadoop Hopsworks

Delegate Access Control to KafkaKafka brokers enforce access control with certificatesPrinciple name extracted from the X.509 Certificate is: projectName__userID

HopsAuthorizer enforces ACLs for the topic ACLs are stored in the distributed database

28

Page 29: Strata Hadoop Hopsworks

Free Text Search for Metadata

29

Free-Text Search

Distributed DatabaseElasticSearch

The Distributed Database is the Single Source of Truth.Zero overhead, streaming API synchronizes with Elasticsearch.

MetaDataDesigner

MetaDataEntry

Page 30: Strata Hadoop Hopsworks

Design your own Extended Metadata

30

Page 31: Strata Hadoop Hopsworks

Zeppelin (with multi-tenancy and Livy/Spark)

Page 32: Strata Hadoop Hopsworks

Automated Installation

32

Vagrant/Chef to spin up on a single hostKaramel/Chef to deploy on AWS/GCE/OpenStack or on-premises

name: HopsWorksec2: type: m3.medium cookbooks: hadoop: github: "hopshadoop/hopsworks-chef" version: "v0.1" groups: ui: size: 1 recipes: - hopsworksmetadata: size: 2 recipes: - hops::nn - hops::rm datanodes: size: 50 recipes: - hops::dn - hops::nm

Page 33: Strata Hadoop Hopsworks

www.hops.site

33

A 2 MW datacenter research and test environment

5 lab modules, planned up to 3-4000 servers, 2-3000 square meters

[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]

Page 34: Strata Hadoop Hopsworks

Demo

34

Page 35: Strata Hadoop Hopsworks

Summing UpHopsWorks is providing a world’s first:Multi-Tenant Hadoop-as-a-ServiceOpen-Source Self-serviceTinker Friendly

35

www.hops.io @hopshadoop

Page 36: Strata Hadoop Hopsworks

The TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,

Gautier Berthou, Salman Niazi, Mahmoud Ismail,Kamal Hakimzadeh, Ermias Gebremeskel, Theofilos Kakantousis, Johan Svedlund Nordström, Someya Sayeh, Vasileios Giannokostas, Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan,Ahmad Al-Shishtawy, Ali Gholami, Paul Mälzer.

Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Hops