Spark Summit Europe: Building a REST Job Server for interactive Spark as a service

Post on 09-Jan-2017

4.276 views 1 download

Transcript of Spark Summit Europe: Building a REST Job Server for interactive Spark as a service

BUILDING A REST JOB SERVER FOR INTERACTIVE SPARK AS A SERVICERomain Rigaux - Cloudera Erick Tryzelaar - Cloudera

WHY?

NOTEBOOKS

EASYACCESSFROMANYWHERE

SHARESPARKCONTEXTSANDRDDs

BUILDAPPS

SPARKMAGIC

WHY SPARKAS A SERVICE?

MARRIEDWITHFULLHADOOPECOSYSTEM

WHY SPARKIN HUE?

HISTORYV1: OOZIE

• Itworks

• Codesnippet

THE GOOD

• SubmitthroughOozie

• Shellac:on

• VerySlow

• Batch

THE BAD

workflow.xmlsnippet.py

stdout

HISTORYV2: SPARK IGNITER

• ItworksbeAer

THE GOOD

• CompilerJar

• Batchonly,noshell

• NoPython,R

• Security

• Singlepointoffailure

THE BAD Compile

Implement

Upload

jsonoutput

Batch

Scala

jar

Ooyala

HISTORYV3: NOTEBOOK

• Likespark-submit/sparkshells

• Scala/Python/Rshells

• Jar/PythonbatchJobs

• NotebookUI

• YARN

THE GOOD

• Beta?

THE BAD

Livy

codesnippet batch

GENERAL ARCHITECTURE

Spark

Spark

Spark

Livy YARN

!"

# $

Livy

Spark

Spark

Spark

YARN

API

!"

# $

GENERAL ARCHITECTURE

LIVY SPARK SERVER

LIVYSPARK SERVER

•RESTWebserverinScalaforSparksubmissions

• Interac:veShellSessionsorBatchJobs

•Backends:Scala,Java,Python,R

•NodependencyonHue

•OpenSource:hAps://github.com/cloudera/

hue/tree/master/apps/spark/java

•Readaboutit:hAp://gethue.com/spark/

ARCHITECTURE

• Standardwebservice:wrapperaroundspark-submit/Sparkshells• YARNmode,Sparkdriversruninsidethecluster(supportscrashes)• Noneedtoinheritanyinterfaceorcompilecode• Extendedtoworkwithadditionalbackends

LIVY WEB SERVERARCHITECTURE

LOCAL“DEV”MODE YARNMODE

LOCAL MODE

LivyServer

Scalatra

SessionManager

Session

SparkContextSpark

Client

SparkClient

SparkInterpreter

LOCAL MODE

LivyServer

Scalatra

SessionManager

Session

SparkClient

SparkClient

SparkContext

SparkInterpreter

LOCAL MODE

SparkClient

1

LivyServer

Scalatra

SessionManager

Session

SparkClient

SparkContext

SparkInterpreter

LOCAL MODE

SparkClient

1

2

LivyServer

Scalatra

SessionManager

Session

SparkClient

SparkContext

SparkInterpreter

LOCAL MODE

SparkClient

SparkInterpreter

1

2

LivyServer

Scalatra

SessionManager

Session

SparkClient

SparkContext

3

LOCAL MODE

SparkClient

1

2

LivyServer

Scalatra

SessionManager

Session

SparkClient

SparkContext

3

4 SparkInterpreter

LOCAL MODE

SparkClient

1

2

LivyServer

Scalatra

SessionManager

Session

SparkClient

SparkContext

3

4

5

SparkInterpreter

YARN-CLUSTERMODE

PRODUCTION SCALABLE

YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

LivyServer

Scalatra

SessionManager

Session

YARN-CLUSTERMODE

SparkInterpreter

LivyServer

YARNMaster

Scalatra

SparkClient

SessionManager

Session

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

YARN-CLUSTERMODE

SparkInterpreter

YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

LivyServer

Scalatra

SessionManager

Session

YARN-CLUSTERMODE

SparkInterpreter

YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

LivyServer

Scalatra

SessionManager

Session

YARN-CLUSTERMODE

SparkInterpreter

YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4LivyServer

Scalatra

SessionManager

Session

YARN-CLUSTERMODE

SparkInterpreter

YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4

5

LivyServer

Scalatra

SessionManager

Session

YARN-CLUSTERMODE

SparkInterpreter

YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4

5

6

LivyServer

Scalatra

SessionManager

Session

YARN-CLUSTERMODE

SparkInterpreter

YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1 7

2

3

4

5

6

LivyServer

Scalatra

SessionManager

Session

YARN-CLUSTERMODE

SparkInterpreter

SESSION CREATION AND EXECUTION%curl-XPOSTlocalhost:8998/sessions\-d'{"kind":"spark"}'{"id":0,"kind":"spark","log":[...],"state":"idle"}

%curl-XPOSTlocalhost:8998/sessions/0/statements-d'{"code":"1+1"}'{"id":0,"output":{"data":{"text/plain":"res0:Int=2"},"execution_count":0,"status":"ok"},"state":"available"}

Jar

Py

Scala

Python

R

Livy

Spark

Spark

Spark

YARN

/batches

/sessions

BATCH OR INTERACTIVE

SHELL OR BATCH?YARNMaster

SparkClient

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

LivyServer

Scalatra

SessionManager

Session

SHELLYARNMaster

SparkClient

YARNNode

pyspark

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

LivyServer

Scalatra

SessionManager

Session

BATCHYARNMaster

SparkClient

YARNNode

spark-submit

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

LivyServer

Scalatra

SessionManager

Session

LIVY INTERPRETERSScala,Python,R…

REMEMBER?YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

LivyServer

Scalatra

SessionManager

Session

SparkInterpreter

INTERPRETERS

• Pipestdin/stdouttoarunningshell

• Executethecode/sendtoSparkworkers

• Performmagicopera:ons

• Oneinterpreterperlanguage• “Swappable”withotherkernels(python,spark..)

Interpreter

>println(1+1)2

println(1+1)

2

LivyServer

INTERPRETER FLOW

Interpreter

LivyServer

>1+1

Interpreter

INTERPRETER FLOW

LivyServer

{“code”:“1+1”}

>1+1

Interpreter

INTERPRETER FLOW

LivyServer Interpreter

1+1{“code”:“1+1”}

>1+1

INTERPRETER FLOW

LivyServer Interpreter

1+1{“code”:“1+1”}

>1+1

Magic

INTERPRETER FLOW

LivyServer

2

Interpreter

1+1{“code”:“1+1”}

>1+1

Magic

INTERPRETER FLOW

{“data”:{“application/json”:“2”}}

LivyServer

2

Interpreter

1+1{“code”:“1+1”}

>1+1

Magic

INTERPRETER FLOW

{“data”:{“application/json”:“2”}}

LivyServer

2

Interpreter

1+1{“code”:“1+1”}

>1+1

2 Magic

INTERPRETER FLOW

INTERPRETER FLOW CHART

ReceivelinesSplitintoChunks

Sendoutputtoserver

Senderrortoserver

Success

ExecuteChunkMagic!

Chunksle[?

Magicchunk?

No

Yes

NoYes

Exampleofparsing

INTERPRETER MAGIC

• table• json• plotting• ...

NO MAGIC

>1+1

Interpreter

1+1

sparkIMain.interpret(“1+1”)

{"id":0,"output":{"application/json":2}}

[('',506610),('the',23407),('I',19540)...]

JSON MAGIC

>countssparkIMain.valueOfTerm(“counts”)

.toJson()

Interpreter

vallines=sc.textFile("shakespeare.txt");valcounts=lines.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_).sortBy(-_._2).map{case(w,c)=>Map("word"->w,"count"->c)}

%jsoncounts

JSON MAGIC

>countssparkIMain.valueOfTerm(“counts”)

.toJson()

Interpreter

{"id":0,"output":{"application/json":[{"count":506610,"word":""},{"count":23407,"word":"the"},{"count":19540,"word":"I"},...]...}

vallines=sc.textFile("shakespeare.txt");valcounts=lines.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_).sortBy(-_._2).map{case(w,c)=>Map("word"->w,"count"->c)}

%jsoncounts

[('',506610),('the',23407),('I',19540)...]

TABLE MAGIC

>counts

Interpreter

vallines=sc.textFile("shakespeare.txt");valcounts=lines.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_).sortBy(-_._2).map{case(w,c)=>Map("word"->w,"count"->c)}

%tablecounts

sparkIMain.valueOfTerm(“counts”).guessHeaders().toList()

TABLE MAGIC

>countssparkIMain.valueOfTerm(“counts”)

.guessHeaders().toList()

Interpreter

vallines=sc.textFile("shakespeare.txt");valcounts=lines.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_).sortBy(-_._2).map{case(w,c)=>Map("word"->w,"count"->c)}

%tablecounts"application/vnd.livy.table.v1+json":{"headers":[{"name":"count","type":"BIGINT_TYPE"},{"name":"name","type":"STRING_TYPE"}],"data":[[23407,"the"],[19540,"I"],[18358,"and"],...]}

PLOT MAGIC

>

sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)

Interpreter

...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))

PLOT MAGIC

>

sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)

Interpreter

...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))

PLOT MAGIC

>png(‘/tmp/..’)>barplot>dev.off()

sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)

Interpreter

...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))

PLOT MAGIC

>png(‘/tmp/..’)>barplot>dev.off()

sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)

File(’/tmp/plot.png’).read().toBase64()

Interpreter

...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))

PLOT MAGIC

>png(‘/tmp/..’)>barplot>dev.off()

sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)

File(’/tmp/plot.png’).read().toBase64()

Interpreter

...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))

{"data":{"image/png":"iVBORw0KGgoAAAANSUhEUgAAAe…"...}...}

• PluggableBackends• Livy'sSparkBackends– Scala– pyspark– R

• IPython/Jupytersupportcomingsoon

PLUGGABLE INTERPRETERS

• Re-usingit• GenericFrameworkforInterpreters

• 51Kernels

JUPYTER BACKEND

SPARK AS A SERVICE

REMEMBER AGAIN?YARNMaster

SparkClient

YARNNode

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

LivyServer

Scalatra

SessionManager

Session

SparkInterpreter

MULTI USERS

YARNNode

SparkContext

LivyServer

Scalatra

SessionManager

Session

SparkInterpreter YARN

Node

SparkContext

SparkInterpreter

YARNNode

SparkContext

SparkInterpreter

SparkClient

SparkClient

SparkClient

SHARED CONTEXTS?

YARNNode

SparkContext

LivyServer

Scalatra

SessionManager

Session

SparkInterpreter

SparkClient

SparkClient

SparkClient

SHARED RDD?

YARNNode

SparkContext

LivyServer

Scalatra

SessionManager

Session

SparkInterpreter

SparkClient

SparkClient

SparkClient

RDD

SHARED RDDS?

YARNNode

SparkContext

LivyServer

Scalatra

SessionManager

Session

SparkInterpreter

SparkClient

SparkClient

SparkClient

RDD

RDD

RDD

YARNNode

SparkContext

LivyServer

Scalatra

SessionManager

Session

SparkInterpreter

SparkClient

SparkClient

SparkClient

RDD

RDD

RDD

SECURE IT?

YARNNode

SparkContext

LivyServer

Scalatra

SessionManager

Session

SparkInterpreter

SparkClient

SparkClient

SparkClient

RDD

RDD

RDD

SECURE IT?

LivyServer

Spark

SparkClient

SparkClient

SparkClient

SPARK AS SERVICE

Spark

SHARING RDDS

PySparkshell

RDD

ShellPythonShell

PySparkshell

RDD

ShellPythonShell

PySparkshell

RDD

ShellPythonShell

r=sc.parallelize([])srdd=ShareableRdd(r)

PySparkshell

RDD{'ak':'Alaska'}

{'ca':'California'}

ShellPythonShell

r=sc.parallelize([])srdd=ShareableRdd(r)

PySparkshell

RDD{'ak':'Alaska'}

{'ca':'California'}

ShellPythonShell

curl-XPOST/sessions/0/statement{'code':srdd.get('ak')}

r=sc.parallelize([])srdd=ShareableRdd(r)

PySparkshell

RDD{'ak':'Alaska'}

{'ca':'California'}

ShellPythonShell

states=SharedRdd('host/sessions/0','srdd')states.get('ak')

r=sc.parallelize([])srdd=ShareableRdd(r)

curl-XPOST/sessions/0/statement{'code':srdd.get('ak')}

• SSLSupport• PersistentSessions• Kerberos

SECURITY

SPARK MAGIC

• FromMicrosop

•PythonmagicsforworkingwithremoteSpark

clusters

•OpenSource:hAps://github.com/jupyter-

incubator/sparkmagic

FUTURE

•Movetoextrepo?

• Security• iPython/Jupyterbackendsandfileformat

• SharednamedRDD/contexts?

• Sharedata• Sparkspecific,languagegeneric,both?• LeverageHue4

https://issues.cloudera.org/browse/HUE-2990

• OpenSource:hAps://github.com/cloudera/

hue/tree/master/apps/spark/java

• Readaboutit:hAp://gethue.com/spark/

•Scala,Java,Python,R

•TypeIntrospec:onforVisualiza:on

•YARN-clusterorlocalmodes

•Codesnippets/compiled

•RESTAPI

•Pluggablebackends

•Magickeywords

•Failureresilient

•Security

LIVY’SCHEAT SHEET

BEDANKT!

TWITTER

@gethue

USER GROUP

hue-user@

WEBSITE

hAp://gethue.com

LEARN

hAp://learn.gethue.com