Strata Hadoop Hopsworks
-
Upload
jim-dowling -
Category
Technology
-
view
130 -
download
0
Transcript of Strata Hadoop Hopsworks
HopsWorks: Multi-Tenant Hadoop-as-a-Service
Jim Dowling Associate Prof @ KTH Stockholm
Senior Researcher @ SICSCEO @ Logical Clocks AB
Hadoop Strata, London, June 3rd 2016
www.hops.io @hopshadoop
Open, Collaborative Software Development
Hadoop Administrator
Enterprise-Level Hadoop: Admin-in-the-Loop
3
Access Control in Hadoop: Apache Ranger
>hdfs dfs -chmod -R 000 /apps/hive
4[http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]
Access Control in Hadoop: Apache Sentry
5How do you ensure the consistency of the policies and the data?
[Mujumdar’15]
Access Control in Relational Databases# Multi-tenancy for alice and bob on db1 and db2grant all privileges on db1.* to ‘alice'@‘%‘;grant all privileges on db2.* to ‘bob'@‘%‘;
6
Consistency of security and privileges guaranteed with foreign keys.
drop db2; // deletes associated privileges
Metadata Totem Poles in Hadoop
7Eventual Consistency
Why the separation of Metadata and Data?
8
HDFS v2
9
DataNodes
HDFS Client
Journal Nodes Zookeeper Nodes
NameNode StandbyNameNode
Max 200 GB
YARN
10
NodeManagers
YARN Client
Zookeeper Nodes
ResourceMgr StandbyResourceMgr
Tinker-Friendly?
There is another way…..
11
Hops: Distributed Metadata for Hadoop
12
HopsFS Architecture
13
NameNodes
NDB
Leader
HDFS Client
DataNodes
> 12 TB
> 2.6 XThroughput
HopsYARN Architecture
14
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource TrackersLeader Election forFailed Scheduler
Up to 10KNode Clusters
Challenges for Project-Level Multi-Tenancy
15
(How can we introduce GitHub-style projects to Hadoop?)
Problem: Sensitive Data needs its own Cluster
16
NSA DataSet
User DataSet
has access to
has access to
Copy/cross-link between data sets
Alice has only one Kerberos Identity. Neither attribute-based access control nor dynamic roles supported in Hadoop.
Alice
Solution: Project-Specific UserIDs
17
Project NSA
Project UsersMember of
NSA__Alice
Users__Alice
Member of
HDFS enforcesaccess control
How can we share DataSets between Projects?
Sharing Data with First-Class DataSets
18
Project NSA
Project UsersMember of
DataSetowns
Add members of Project NSA to the DataSet group
NSA__Alice
Users__Alice
Member of
HopsWorks: Project-Level Multi-Tenancy
19
System UserAuthentication Provider
- JDBC Realm- 2-Factor Authentication- LDAP
20
HopsWorks Enforces Dynamic Roles
21
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
ProjectsSecure
Impersonation
Kafka
X.509 Certificates
X.509 Certificate Per Project-Specific User
22
Authenticate
Add/DelUsers
Distributed Database
Insert/Remove CertsProject Mgr
RootCA
ServicesHadoopSparkKafkaetc
Cert Signing Requests
ProjectA project has an ownerA project is a collection of
- Members- HDFS DataSets - Kafka Topics- Notebooks and Jobs
A project has quotas
23
projectdataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
Project RolesData Owner Privileges
- Import/Export data- Manage Membership- Share DataSets, Topics
Data Scientist Privileges- Write and Run code
24
We delegate administration of privileges to usersjust like GitHub
Elastic HadoopEach Project has:YARN CPU Quota (in mins)HDFS Storage Quota (in GB/TB)Uber-Style Pricing
25
Sharing DataSets/Topics between Projects
26
The same as Sharing Folders in Dropbox
Delegate Access Control to HDFSHDFS enforces access control- UserID per Project- GroupID per Project and
DataSetMetadata Integrity using Foreign Keys- Removing a project removes
all users, groups, and (optionally) DataSets
27
Delegate Access Control to KafkaKafka brokers enforce access control with certificatesPrinciple name extracted from the X.509 Certificate is: projectName__userID
HopsAuthorizer enforces ACLs for the topic ACLs are stored in the distributed database
28
Free Text Search for Metadata
29
Free-Text Search
Distributed DatabaseElasticSearch
The Distributed Database is the Single Source of Truth.Zero overhead, streaming API synchronizes with Elasticsearch.
MetaDataDesigner
MetaDataEntry
Design your own Extended Metadata
30
Zeppelin (with multi-tenancy and Livy/Spark)
Automated Installation
32
Vagrant/Chef to spin up on a single hostKaramel/Chef to deploy on AWS/GCE/OpenStack or on-premises
name: HopsWorksec2: type: m3.medium cookbooks: hadoop: github: "hopshadoop/hopsworks-chef" version: "v0.1" groups: ui: size: 1 recipes: - hopsworksmetadata: size: 2 recipes: - hops::nn - hops::rm datanodes: size: 50 recipes: - hops::dn - hops::nm
www.hops.site
33
A 2 MW datacenter research and test environment
5 lab modules, planned up to 3-4000 servers, 2-3000 square meters
[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]
Demo
34
Summing UpHopsWorks is providing a world’s first:Multi-Tenant Hadoop-as-a-ServiceOpen-Source Self-serviceTinker Friendly
35
www.hops.io @hopshadoop
The TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,Kamal Hakimzadeh, Ermias Gebremeskel, Theofilos Kakantousis, Johan Svedlund Nordström, Someya Sayeh, Vasileios Giannokostas, Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan,Ahmad Al-Shishtawy, Ali Gholami, Paul Mälzer.
Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Hops