Spark Summit EU talk by Jim Dowling

23
SPARK SUMMIT EUROPE 2016 On Premise Spark-as-a-Service on YARN Jim Dowling Associate Prof @ KTH, Stockholm Senior Researcher, SICS Swedish ICT CEO, Logical Clocks AB Twitter: @jim_dowling

Transcript of Spark Summit EU talk by Jim Dowling

Page 1: Spark Summit EU talk by Jim Dowling

SPARK SUMMIT EUROPE 2016

On Premise Spark-as-a-Service on YARNJim DowlingAssociate Prof @ KTH, StockholmSenior Researcher, SICS Swedish ICTCEO, Logical Clocks ABTwitter: @jim_dowling

Page 2: Spark Summit EU talk by Jim Dowling

Spark-as-a-Service in Sweden• SICS ICE: datacenter research and test environment• Hopsworks: Spark/Kafka/Flink/Hadoop-as-a-service

– Built on Hops Hadoop (www.hops.io)– Over 100 active users– Spark the platform of choice

2

Page 3: Spark Summit EU talk by Jim Dowling

HopsFS Architecture

3

NameNodes

NDB

Leader

HDFS Client

DataNodes

Page 4: Spark Summit EU talk by Jim Dowling

Hops-YARN Architecture

4

ResourceMgrs

NDB

Scheduler

YARN Client

NodeManagers

Resource Trackers

Heartbeats(70-95%)

AM Reqs(5-30%)

Page 5: Spark Summit EU talk by Jim Dowling

Pluggable DB: Data Abstraction Layer

5

NameNode(Apache v2)

DAL API(Apache v2)

NDB-DAL-Impl(GPL v2)

Other DB(Other License)

hops-2.7.3.jar dal-ndb-2.7.3-7.5.4.jar

Page 6: Spark Summit EU talk by Jim Dowling

6

HopsFS Throughput vs Apache HDFS

NDB Setup: Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.

Page 7: Spark Summit EU talk by Jim Dowling

HOPSWORKS7

Page 8: Spark Summit EU talk by Jim Dowling

Project-Based Multi-Tenancy• A project is a collection of

– Users with Roles– HDFS DataSets– Kafka Topics– Notebooks, Jobs

• Per-Project quotas– Storage in HDFS– CPU in YARN

• Uber-style Pricing

• Sharing across Projects– Datasets/Topics

8

project

dataset 1

dataset N

Topic 1

Topic N

Kafka

HDFS

Page 9: Spark Summit EU talk by Jim Dowling

9

[email protected]

NSA__Alice

Authenticate

Users__Alice

HopsFS

HopsYARN

ProjectsSecure

Impersonation

Kafka

X.509 Certificates

Dynamic Roles for Hadoop/Kafka

Page 10: Spark Summit EU talk by Jim Dowling

Look Ma, No Kerberos!• For each project, a user is issued with a X.509

certificate, containing the project-specific userID.• Inspired by Netflix’ BLESS system.

• Services are also issued with X.509 certificates.– Both user and service certs are signed with the same CA.– Services extract the userID from RPCs to identify the caller.

Page 11: Spark Summit EU talk by Jim Dowling

11

[email protected]

Add/DelUsers

Distributed Database

Insert/Remove CertsProject Mgr

RootCA

ServicesHadoopSparkKafkaetc

Cert Signing Requests

Project-User Certificates

Page 12: Spark Summit EU talk by Jim Dowling

12

[email protected]

1. Launch Spark Job

Distributed Database

2. Get certs,service endpoints

YARN Private LocalResources

Spark Streaming App

4. Materialize certs

3. YARN Job, config

6. Get Schema

7. Consume Produce

5. Read Certs

HopsworksKafkaUtil

Spark Streaming on YARN with Hopsworks

8. Authenticate

Page 13: Spark Summit EU talk by Jim Dowling

Spark Stream Producer in Secure KafkaSparkConf sparkConf = …

JavaSparkContext jsc = …

1. Discover: Schema Registry and Kafka Broker Endpoints2. Create: Kafka Properties file with certs and broker details3. Create: producer using Kafka Properties4. Download: the Schema for the Topic from the Schema Registry5. Distribute: X.509 certs to all hosts on the cluster6. Cleanup securely// write to Kafka

13

Developer

Operations

Page 14: Spark Summit EU talk by Jim Dowling

Spark Streaming Producer in HopsworksList<String> topics = KafkaUtil.getTopics();…SparkProducer sparkProducer =

KafkaUtil.getSparkProducer(topic);…Map<String, String> message = …

sparkProducer.produce(message);…sparkProducer.close();

14https://github.com/hopshadoop/hops-kafka-examples

Page 15: Spark Summit EU talk by Jim Dowling

Spark Streaming Consumer in HopsworksJavaStreamingContext jssc = …List<String> topics = KafkaUtil.getTopics();

SparkConsumer consumer = KafkaUtil.getSparkConsumer(jssc, topics);…// Avro schema downloaded by framework hereGenericRecord genericRecord = KafaUtil.getRecordInjections().get(topic);

jssc.start();jssc.awaitTermination();

15https://github.com/hopshadoop/hops-kafka-examples

Page 16: Spark Summit EU talk by Jim Dowling

Zeppelin Support for Spark/Livy

16

Page 17: Spark Summit EU talk by Jim Dowling

Livy to launch Spark 2.0 Jobs

[Image from: http://gethue.com]

Page 18: Spark Summit EU talk by Jim Dowling

Debugging Spark with DrElephant• Project-specific view of performance/correctness

issues for completed Spark Jobs• Customizable

heuristics• Doesn’t show

killed jobs

Page 19: Spark Summit EU talk by Jim Dowling

Karamel/Chef for Automated Installation

19

Google Compute Engine BareMetal

Page 20: Spark Summit EU talk by Jim Dowling

20

Demo

Page 21: Spark Summit EU talk by Jim Dowling

Summary• Hopsworks provides first-class support for

Spark-as-a-Service– Streaming or Batch Jobs– Zeppelin Notebooks

• Hopworks simplifies writing secureSparkStreaming applications with Kafka

21http://github.com/hopshadoop

Hops[Hadoop For Humans]

Page 22: Spark Summit EU talk by Jim Dowling

Hops TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail,

Theofilos Kakantousis, Konstantin Popov, Antonios Kouzoupis, Ermias Gebremeskel.

Alumni: Vasileios Giannokostas, Johan Svedlund Nordström, Rizvi Hasan, Paul Mälzer, Bram Leenders,Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh,Mariano Valles, Ying Lieu.

Page 23: Spark Summit EU talk by Jim Dowling

SPARK SUMMIT EUROPE 2016

THANK YOU.www.hops.io