Big Data Scala by the Bay: Interactive Spark in your Browser

43
INTERACTIVE SPARK IN YOUR BROWSER Romain Rigaux [email protected] Erick Tryzelaar [email protected]

Transcript of Big Data Scala by the Bay: Interactive Spark in your Browser

Page 1: Big Data Scala by the Bay: Interactive Spark in your Browser

INTERACTIVE SPARK IN YOURBROWSER

Romain Rigaux [email protected] Tryzelaar [email protected]

Page 2: Big Data Scala by the Bay: Interactive Spark in your Browser

GOALOF HUEWEB INTERFACE FOR ANALYZING DATA WITH APACHE HADOOP

SIMPLIFY AND INTEGRATE

FREE AND OPEN SOURCE

—> WEB “EXCEL” FOR HADOOP

Page 3: Big Data Scala by the Bay: Interactive Spark in your Browser

VIEW FROM30K FEET

Hadoop Web Server

You, your colleagues and even that friend that uses IE9 ;)

Page 4: Big Data Scala by the Bay: Interactive Spark in your Browser

WHY SPARK?

SIMPLER (PYTHON, STREAMING, INTERACTIVE…)

OPENS UP DATA TO SCIENCE

SPARK —> MR

Apache Spark

Spark Streaming

MLlib(machine learning)

GraphX(graph)

Spark SQL

Page 5: Big Data Scala by the Bay: Interactive Spark in your Browser
Page 6: Big Data Scala by the Bay: Interactive Spark in your Browser
Page 7: Big Data Scala by the Bay: Interactive Spark in your Browser

WHYIN HUE? MARRIED WITH FULL HADOOP ECOSYSTEM (Hive Tables, HDFS, Job Browser…)

Page 8: Big Data Scala by the Bay: Interactive Spark in your Browser

WHYIN HUE? Multi user, YARN, Impersonation/SecurityNot yet-another-app-to-install

...

Page 9: Big Data Scala by the Bay: Interactive Spark in your Browser

• It works

HISTORYV1: OOZIE THE GOOD

• Submit through Oozie

• Slow

THE BAD

Page 10: Big Data Scala by the Bay: Interactive Spark in your Browser

• It works better

HISTORYV2: SPARK IGNITER THE GOOD

• Compiler Jar

• Batch

THE BAD

Page 11: Big Data Scala by the Bay: Interactive Spark in your Browser

• It works even better

• Scala / Python / R shells

• Jar / Py batches

• Notebook UI

• YARN

HISTORYV3: NOTEBOOK THE GOOD

• Still new

THE BAD

Page 12: Big Data Scala by the Bay: Interactive Spark in your Browser

GENERALARCHITECTURE

Livy

Spark

Spark

Spark

YARN

Backend partWeb part

Page 13: Big Data Scala by the Bay: Interactive Spark in your Browser

GENERALARCHITECTURE

Livy

Spark

Spark

Spark

YARN

Backend partWeb part

Page 14: Big Data Scala by the Bay: Interactive Spark in your Browser

Notebook with snippets

WEBARCHITECTURE

Server

Spark

ScalaCommon API

Pig Hive

Livy … HS2

Scala

Hive

Specific APIs

AJAXcreate_session()execute()…

REST Thrift

OpenSession()ExecuteStatement()

/session/sessions/{sessionId}/statements

Page 15: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY SPARK SERVER

Page 16: Big Data Scala by the Bay: Interactive Spark in your Browser

• REST Web server in Scala

• Interactive Spark Sessions and Batch Jobs

• Type Introspection for Visualization

• Running sessions in YARN local

• Backends: Scala, Python, R

• Open Source: https://github.com/cloudera/hue/tree/master/apps/spark/java

LIVYSPARK SERVER

Page 17: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

Livy Server

Scalatra

Session Manager

Session

Page 18: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY WEB SERVERARCHITECTURE

Livy Server

YARN Master

Scalatra

Spark Client

Session Manager

Session

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

Page 19: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

Livy Server

Scalatra

Session Manager

Session

Page 20: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

Livy Server

Scalatra

Session Manager

Session

Page 21: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4Livy Server

Scalatra

Session Manager

Session

Page 22: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4

5

Livy Server

Scalatra

Session Manager

Session

Page 23: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4

5

6Livy Server

Scalatra

Session Manager

Session

Page 24: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1 7

2

3

4

5

6Livy Server

Scalatra

Session Manager

Session

Page 25: Big Data Scala by the Bay: Interactive Spark in your Browser

SESSION CREATIONAND EXECUTION

% curl -XPOST localhost:8998/sessions \ -d '{"kind": "spark"}'{ "id": 0, "kind": "spark", "log": [...], "state": "idle"}

% curl -XPOST localhost:8998/sessions/0/statements -d '{"code": "1+1"}'{ "id": 0, "output": { "data": { "text/plain": "res0: Int = 2" }, "execution_count": 0, "status": "ok" }, "state": "available"}

Page 26: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY INTERPRETERSScala, Python, R…

Page 27: Big Data Scala by the Bay: Interactive Spark in your Browser

INTERPRETERS

• Pipe stdin/stdout to a running shell

• Execute the code / send to Spark workers

• Perform magic operations

• One interpreter by language

• “Swappable” with other kernels (python, spark..)

Interpreter

> println(1 + 1)2

println(1 + 1)

2

Page 28: Big Data Scala by the Bay: Interactive Spark in your Browser

INTERPRETER FLOW

CURL

Hue

Livy Server Livy Session Interpreter

1+1

2

{ “data”: { “application/json”: “2” }}

1+1

2

Page 29: Big Data Scala by the Bay: Interactive Spark in your Browser

INTERPRETER FLOW CHART

Receive lines Split lines

Send outputto server

Success

Incomplete Merge withnext lineError

Execute LineMagic!

Linesleft?

Magic line?

No

Yes

NoYes

Page 30: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY INTERPRETERS

trait Interpreter { def state: State def execute(code: String): Future[JValue] def close(): Unit}

sealed trait State case class NotStarted() extends State case class Starting() extends Statecase class Idle() extends Statecase class Running() extends Statecase class Busy() extends Statecase class Error() extends Statecase class ShuttingDown() extends Statecase class Dead() extends State

Page 31: Big Data Scala by the Bay: Interactive Spark in your Browser

LIVY INTERPRETERS

trait Interpreter { def state: State def execute(code: String): Future[JValue] def close(): Unit}

sealed trait Statecase class NotStarted() extends Statecase class Starting() extends Statecase class Idle() extends Statecase class Running() extends Statecase class Busy() extends Statecase class Error() extends Statecase class ShuttingDown() extends Statecase class Dead() extends State

Page 32: Big Data Scala by the Bay: Interactive Spark in your Browser

SPARK INTERPRETER

class SparkInterpeter extends Interpreter { … private var _state: State = NotStarted() private val outputStream = new ByteArrayOutputStream() private var sparkIMain: SparkIMain = _ def start() = { ... _state = Starting() sparkIMain = new SparkIMain(new Settings(), new JPrintWriter(outputStream, true)) sparkIMain.initializeSynchronous() ...

Interpreter

new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))

Page 33: Big Data Scala by the Bay: Interactive Spark in your Browser

SPARK INTERPRETER

private var sparkContext: SparkContext = _def start() = { ... val sparkConf = new SparkConf(true) sparkContext = new SparkContext(sparkConf) sparkIMain.beQuietDuring { sparkIMain.bind("sc", "org.apache.spark.SparkContext", sparkContext, List("""@transient""")) } _state = Idle()}

sparkIMain.bind("sc", "org.apache.spark.SparkContext",sparkContext, List("""@transient"""))

Page 34: Big Data Scala by the Bay: Interactive Spark in your Browser

EXECUTING SPARKprivate def executeLine(code: String): ExecuteResult = { code match { case MAGIC_REGEX(magic, rest) => executeMagic(magic, rest) case _ => scala.Console.withOut(outputStream) { sparkIMain.interpret(code) match { case Results.Success => ExecuteComplete(readStdout()) case Results.Incomplete => ExecuteIncomplete(readStdout()) case Results.Error => ExecuteError(readStdout()) } ...

case MAGIC_REGEX(magic, rest) =>

case _ =>

Page 35: Big Data Scala by the Bay: Interactive Spark in your Browser

INTERPRETER MAGIC

private val MAGIC_REGEX = "^%(\\w+)\\W*(.*)".r

private def executeMagic(magic: String, rest: String): ExecuteResponse = { magic match { case "json" => executeJsonMagic(rest) case "table" => executeTableMagic(rest) case _ => ExecuteError(f"Unknown magic command $magic") }}

case "json" => executeJsonMagic(rest) case "table" => executeTableMagic(rest) case _ => ExecuteError(f"Unknown magic command $magic")

Page 36: Big Data Scala by the Bay: Interactive Spark in your Browser

INTERPRETER MAGICprivate def executeJsonMagic(name: String): ExecuteResponse = { sparkIMain.valueOfTerm(name) match { case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value.asInstanceOf[RDD[_]].take(10))))

case Some(value) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value)))

case None => ExecuteError(f"Value $name does not exist") }}

case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value.asInstanceOf[RDD[_]].take(10))))

case Some(value) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value)))

Page 37: Big Data Scala by the Bay: Interactive Spark in your Browser

TABLE MAGIC

"application/vnd.livy.table.v1+json": { "headers": [ { "name": "count", "type": "BIGINT_TYPE" }, { "name": "name", "type": "STRING_TYPE" } ], "data": [ [ 23407, "the" ], [ 19540, "I" ], [ 18358, "and" ], ... ]}

val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%table counts%table counts

Page 38: Big Data Scala by the Bay: Interactive Spark in your Browser

TABLE MAGIC

"application/vnd.livy.table.v1+json": { "headers": [ { "name": "count", "type": "BIGINT_TYPE" }, { "name": "name", "type": "STRING_TYPE" } ], "data": [ [ 23407, "the" ], [ 19540, "I" ], [ 18358, "and" ], ... ]}

val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%table counts

Page 39: Big Data Scala by the Bay: Interactive Spark in your Browser

JSON MAGIC

val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%json counts

{ "id": 0, "output": { "application/json": [ { "count": 506610, "word": "" }, { "count": 23407, "word": "the" }, { "count": 19540, "word": "I" }, ... ] ...}%json counts

Page 40: Big Data Scala by the Bay: Interactive Spark in your Browser

JSON MAGIC

val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%json counts

{ "id": 0, "output": { "application/json": [ { "count": 506610, "word": "" }, { "count": 23407, "word": "the" }, { "count": 19540, "word": "I" }, ... ] ...}

Page 41: Big Data Scala by the Bay: Interactive Spark in your Browser

• Stability and Scaling• Security• iPython/Jupyter backends

and file format

COMING SOON

Page 42: Big Data Scala by the Bay: Interactive Spark in your Browser

DEMO TIME

Page 43: Big Data Scala by the Bay: Interactive Spark in your Browser

TWITTER

@gethue

USER GROUP

hue-user@

WEBSITE

http://gethue.com

LEARN

http://learn.gethue.com

THANKS!