1. Spark Kernel IBM Emerging Internet Technologies
2. Outline Scenario Problem How do you enable interactive
applications against Apache Spark? Solution Spark Kernel
Architecture Memory issue Comm API Livesheets (line of business
tool) RESTful Server (query interface) Extending the Spark Kernel
Summary & Questions
3. Scenario Livesheets prototype Needs to be able to build
computations on the fly Needs to be able to perform computations on
static (historical) data as well as dynamic (streaming) data Needs
to be responsive (order of seconds instead of minutes)
4. Problem How do you enable interactive applications? Spark
Submit for job submission to Apache Spark JDBC and other offerings
available for Spark SQL RESTful interfaces available to submit jars
Spark Shell offers code snippet support to execute against a Spark
cluster
5. Problem How do you enable interactive applications? Used
Spark Submit Bundled up Spark-based computations into a jar Started
an external process to run the Spark Submit script against the
jar
6. Problem How do you enable interactive applications? What was
wrong? Rebundle the jar every time a computation changed Not easy
to attach to an existing Spark job Getting results involved writing
to a data store and then reading back out Very slow turnaround
7. Solution: Spark Kernel Scala application that can do the
following: Define and execute raw Scala source code Define and run
Spark tasks via code snippets or jars Collect results directly from
a Spark cluster Benefits Avoid friction of shipping jars and
reading results from peripheral systems Well-defined API
(IPython/Jupyter) Acts as a proxy for Spark applications such that
they can run remotely away from Spark Provides a client library for
application development Spark Cluster Master Worker Worker Worker
Worker Kernel IPython App 1 Kernel Client library App2 Kernel
Client library ZeroMQ with IPython message protocol
8. Kernel Architecture Spark Cluster Master Worker Worker
Worker Worker Kernel MQ Akka Message Parsing and Validation Routing
Message Handling Scala Interpreter Class Server Spark Context
Heartbeat Shell Control StdIn IOPub
9. Kernel Architecture Spark Cluster Master Worker Worker
Worker Worker Kernel MQ Akka Message Parsing and Validation Routing
Message Handling Scala Interpreter Class Server Spark Context
Heartbea t She ll Control StdI n IOPu b Why ZeroMQ? Used by IPython
Responsiveness Building blocks have behavior Publisher sends
messages to all subscribers
10. Kernel Architecture Spark Cluster Master Worker Worker
Worker Worker Kernel MQ Akka Message Parsing and Validation Routing
Message Handling Scala Interpreter Class Server Spark Context
Heartbea t She ll Control StdI n IOPu b Why Akka? Concurrency Code
isolation Fault tolerance Scalability
11. IPython Protocol Specifies incoming and outgoing messages
handled by the kernel Defines the purposes of the five channels of
communication Channels of Communication Kernel MQ Heartbeat Shell
Control StdIn IOPub ZeroMQ API Uses ZeroMQ for socket communication
via the five defined ports Uses ZMTP as the wire protocol
12. Heartbeat Used to indicate that the kernel is still alive
Echoes received messages back to client Primarily used by IPython
Channels of Communication Kernel MQ Heartbeat Shell Control StdIn
IOPub Shell Used to communicate requests from a client to the
kernel Main purposes are code execution and Comm messages from a
client
13. Control Serves as a higher priority shell channel Typically
used to receive shutdown signals Channels of Communication Kernel
MQ Heartbeat Shell Control StdIn IOPub StdIn Used to communicate
requests from the kernel to the client(s) Primarily used by IPython
as a form of communication for users through the UI
14. IOPub Broadcasts messages to all listening clients Used to
communicate side effects (standard out/error) as well as Comm
messages Channels of Communication Kernel MQ Heartbeat Shell
Control StdIn IOPub
15. Processing Messages Kernel Akka Message Parsing and
Validation Routing Message Handling Message Parsing and Validation
Uses Akka actors wrapping JeroMQ as an abstraction to parse
messages Calculates an HMAC (keyed-hash message authentication
code) using SHA-256 and a secret key to validate against a
signature in a message
16. Processing Messages Kernel Akka Message Parsing and
Validation Routing Message Handling Routing Incoming messages are
routed by message type to associated message handler actors
Outgoing messages are routed by message type to associated
channels
17. Processing Messages Kernel Akka Message Parsing and
Validation Routing Message Handling Message Handling Each message
type has an associated Akka actor to handle the request Some
handlers use child actors to perform tasks, protecting the state of
the handler by following Erlangs Error Kernel Pattern as well as
reducing strain on the handler
18. Scala Interpreter Kernel Scala Interpreter Class Server
Spark Context Scala Interpreter Uses the Spark REPL API to execute
Scala code Contains zero modifications to Sparks REPL Contains
injected variables to provide Spark APIs and kernel APIs including
magics and Comm communication
19. Scala Interpreter Kernel Scala Interpreter Class Server
Spark Context Class Server Exposes generated REPL classes to the
Spark cluster In Sparks Scala 2.10 implementation of the REPL, this
is created for us Spark Context Standard Scala-based Spark Context
Exposed as a variable named sc for user submitted code
20. Kernel Kernel Client Architecture Heartbea t
ShellControlStdInIO/Pub Kernel Client MQ Akka Message Parsing and
Validation Routing Message Handling API Application Expose public
methods accessible from Scala and Java Client sockets mirror and
communicate with kernel sockets Actor system for client shares
codebase with kernel
21. Kernel Client Example
22. Memory Issue Scala REPL (therefore Spark Shell) Generates
new classes with each code snippet compiled (leads to PermGen space
issues on JVM) Instantiates a new Request class instance per
execution to hold state (leads to OutOfMemory exception)
23. Memory Issue Comm API to the rescue!
24. Comm API Frontend (Client) Backend (Kernel) Flexibility
Bidirectional communication Ability to programmatically define
messages and their interactions Performance Avoid recompiling code
Does not keep execution state Simplicity Start (open) communication
Send data (msg) Stop (close) communication open msg close
25. Comm API Frontend (Client) Backend (Kernel) Comm Open
Request Establishes a new link between the frontend and backend Can
contain data needed for initialization { "comm_id" : "u-u-i-d",
"target_name" : "my_comm", "data" : {} } open msg close
26. Comm API Frontend (Client) Backend (Kernel) Comm Msg
Request Primary form of communication Contains data relevant to the
request open msg close { "comm_id" : "u-u-i-d", "data" : {} }
27. Comm API Frontend (Client) Backend (Kernel) Comm Close
Request Removes the link between the front and back end components
Can contain data needed for teardown open msg close { "comm_id" :
"u-u-i-d", "data" : {} }
28. Livesheets
29. RESTful Server
30. Extending the Spark Kernel PySpark support Zeppelin
integration
31. Summary Goal was to provide an API to enable interactive
Spark applications Kernel provides a responsive API to use Apache
Spark Submit code snippets in same fashion as Spark Shell Use Comm
API for programmatically-defined messages Kernel implements IPython
message protocol Able to use with IPython notebooks out of the box
Repository: https://github.com/ibm-et/spark-kernel