Hopsworks Secure...

38
Hopsworks: Secure Spark/Flink/Kafka-as-a-Service Dr. Gautier Berthou Senior Researcher @ SICS Hops @hopshadoop http://github.com/hopshadoop http://www.hops.io

Transcript of Hopsworks Secure...

Page 1: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Hopsworks: Secure Spark/Flink/Kafka-as-a-Service

Dr. Gautier Berthou Senior Researcher @ SICS

Hops

@hopshadoop

http://github.com/hopshadoop

http://www.hops.io

Page 2: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

HopsFS: Next Generation HDFS*

16xThroughput

FasterBigger

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

37xNumberoffiles

2/48

Scale Challenge Winner (2017)

Page 3: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Streaming-as-a-Service in Sweden

• SICS ICEl Datacenter research environment

• HopsworksSpark/Flink/Kafka/Tensorflow-as-a-Service– Built on HopsFS/YARN– >150 active users

Page 4: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

I want to Spark Up,all by myself with a piece of Flink.

Self-Service Streaming-as-a-Service

Page 5: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Monitor

LogsData Src Flink/SparkPipe

…...Sink

WAL & Checkpoints (S3/HDFS/RocksDB)

Basic Services needed for Stream Processing

Page 6: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

HopsFS YARN

Grafana/InfluxDB

Elastic/Kibana

Public Cloud or On-Premise

Parquet

Data Src

Batch Analytics

Flink/SparkKafka

…...MySQL

Hopsworks

Page 7: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Ostrich Day: 2018-05-25

http://www.computerweekly.com/news/450295538/D-Day-for-GDPR-is-25-May-2018

General Data Protection Regulation (GDPR)

Page 8: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Manage Projects like GitHub

2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 8/48

Page 9: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Share like in Dropbox

2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48

Share any Data Source/Sink: HDFS Datasets, Kafka Topics, etc

Page 10: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Only Modern Data Parallel Platforms

2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 10/48

Page 11: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Workflow/Jobs and Notebook Support

2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 11/48

Page 12: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Custom Python Environments with Conda

2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 12/48

Python libraries are usable by Spark/Tensorflow

Page 13: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Privacy-by-Design with Projects, Data, Users

2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 13/48

Page 14: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Proj-42

Projects for Software-as-a-Service

A Project is a Grouping of Users and Data

Proj-X

Shared TopicTopic /Projs/My/Data

Proj-AllCompanyDB

Page 15: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

SaaS IoT Data Platform

Sensor Node

Sensor Node

Sensor Node

Field Gateway

Storage

Analysis

Ingestion

ACME

Evil Corp

IoT Infrastructure

Page 16: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Multi-Cloud SaaS IoT Data Platform

ACME DontBeEvil Corp Evil-Corp

AWS Google Cloud

OracleCloud

User Apps control IoT Devices

IoT Data Platform

Field Gateway Field Gateway Field Gateway

Page 17: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

IoT Project

Kafka Topic

SaaS IoT Platform: Project per Customer

ACME Project

ACME Topic

ACME HDFSDataset

Data Stream

Generic AnalyticsShared

Custom Analytics

ACME manage membership

Page 18: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Project Roles

•Data Owner Privileges- Import/Export data- Manage Membership- Share DataSets,Topics

•Data Scientist Privileges- Write and Run code

18/48

We delegate administration of privileges to users

Berlin Buzzwords, Hopsworks, J Dowling, June 2017

Page 19: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Dynamic Roles for Hadoop

19/48

[email protected]

Authenticate

Hopsworks

projX__Alice

proj42__Alice

1. Alice has a new HDFS user per Project (ProjectUser)projX__Aliceproj42__Alice

2. Each ProjectUser has its own SSL/TLS cert.

Page 20: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Project Quotas

• Per-Project quotas– Storage in HDFS– CPU in YARN (Uber-style Pricing)

• Sharing is not Copying– Datasets/Topics

Page 21: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

• Hops Hadoop• Apache Kafka• ELK Stack• Grafana/InfluxDB• Jupyter/Apache Zeppelin

HopsworksSelf-Service

Tooling for Streaming

Page 22: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Manage & Share• Topics• ACLs• Avro Schemas

Kafka Self-Service UI

Page 23: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/

Realtime Logs

l YARN aggregates logs on job completion- No good to us for Streaming

l Collect logs and make them searchable in real-time using Logstash, Elasticsearch, and Kibana- Log4j auto-configured to write to Logstash

Page 24: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Elasticsearch, Logstash,

Kibana(ELK Stack)

Realtime Logs

Page 25: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Graphite/InfluxDB

and Grafana

Resource Monitoring/Alerting

Page 26: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Zeppelin Notebooks

26/48Berlin Buzzwords, Hopsworks, J Dowling, June 2017

Page 27: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Jupyter Notebooks

27/48

Page 28: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Hops Roadmap

5/30/2012 www.hops.io 28/48

Page 29: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Berlin Buzzwords, Hopsworks, J Dowling, June 2017

Dela – A Global Ecosystem for Datasets

29/48

Page 30: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Distributed Tensorflow on YARN

30

1. GPU-as-a-resource in Hops-YARN

2. Tensorflow-on-Spark

3. Native Tensorflow-on-YARNwith Infiniband Support

https://github.com/hopshadoop/hops-tensorflow

Page 31: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

30 namenodes/datanodes and 6 NDB nodes were used. Small file size was 4 KB. HopsFs files were stored on Intel 750 Series SSDs

HopsFS Small Files Performance (Early Results)

31/48

Page 32: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Multi-Data-Center HopsFS

32/48

NDB NDB

DN DN DN DN

Client

Synchronous Replication of Blocks

Network Partition Identification Service

NNNN NNNN

Asynchronous Multi-Master Replication of Metadata

Hops-eu-west1 Hops-eu-west2

Page 33: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Hive Metastore is Moving in with HopsFS

33/48

HopsFS

Hive MetaStore

Page 34: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Hive Metastore is Moving in with HopsFS

34/48

HopsFSHive MetaStore

Hive MetaStore

Page 35: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Strongly Consistent Hive Metadata

35/48

1.

3.

2.

Removing the HDFSbacking directory removes the Table from the HiveMetastore

Page 36: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Summary

•Europe’s Only Hadoop Distribution – Hops Hadoop- Fully Open-Source

•Hops supports larger/faster Hadoop Clusters- More scalable, tinker-friendly, and fully open-source.

•Hopsworks is a new Data Platform built on HopsFS with first-class support for Streaming- Spark or Flink

2017-12-22 36/48Berlin Buzzwords, Hopsworks, J Dowling, June 2017

Page 37: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

The Hops Team

Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Zahin AzherRashid, Robin Andersson, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Filotas Siskos.

Active:

Alumni:

Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Hops Heads

Page 38: Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

Thank You.Follow us: @hopshadoopStar us: http://github.com/hopshadoop/hopsworksJoin us: http://www.hops.io

Thank You.Follow us: @hopshadoopStar us: http://github.com/hopshadoop/hopsworksJoin us: http://www.hops.io

Hops