Developing apache spark jobs in .net using mobius
-
Upload
shareddatamsft -
Category
Technology
-
view
638 -
download
0
Transcript of Developing apache spark jobs in .net using mobius
Developing Apache Spark Jobs in .NET using
MobiusKaarthik Sivashanmugam
@kaarthikss
dotnetfringe 2016
Apache Spark• General purpose cluster computing system for big data processing
and analytics• Ease of programming• High performance• Unified API to solve a diverse set of complex data problems• API in Scala, Java, Python & R
Apache Spark Key Concepts• Data• RDD – Resilient Distributed Dataset• Transformation & Action• DataFrame• Dstream
• Cluster• Driver• Executor
Mobius: C# API for Spark• Enable organizations invested deeply in .NET to build Apache Spark
applications in C#
• Reuse of existing .NET libraries in Spark applications
.NET & Spark
Scala/Java API
SparkR PySpark
Mobius: C# API
Apache Spark
Spark Apps in .NET
Word Count in Spark using RDDScala
RDD of lines in the fileRDD of words in the file
RDD of tuple - (word, 1)RDD of tuple - (word, count)
Action that triggers job
Word Count in Spark using RDD
C#
Scala
F#
Develop & Launch Mobius Applications
Spark Client
A
Get Mobius release
B
Get Mobius driverand dependencies
1Add Reference toMobius package in NuGet
2Develop, debug, testMobius driver application
3Build Mobius driver
Runsparkclr-submit.cmd
orsparkclr-submit.sh C
Runs Spark job
Example: sparkclr-submit.cmd --master spark://IP:PORT --total-executor-cores 200--executor-memory 12g -- conf spark.eventLog.enabled=true-- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog--exe Pi.exe D:\Mobius\examples\Pi
DemoImplementing a simple Mobius driver program using DataFrame
Structured Data in Mobius using DataFrame
JSON Cassandra
Note – Dataset is replacing DataFrame in Spark
Mobius & Spark
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Driver
CLR
IPC Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Mobius can be used with any existing Spark cluster(Standalone, YARN) inWindows & Linux
Mobius in Linux• Mono is used for using Mobius with Spark in Linux
• Mobius project CI (build, unit & functional tests) in Ubuntu
• Mobius validated in Ubuntu, CentOS, OSX
• Mobius validated with Spark clusters in Azure HDInsight and Amazon Web Services EMR
• More info at linux-instructions.md @ GitHub
Kafka Message Processing in Mobius using DStream
Initialize StreamingContext & Checkpoint
Create Kafka DStream
Use DStream transformations to count logs by loglevel within a time window
Save log count
Start stream processing
Internals of Driver & Worker
Driver-side Interop
CSharpRunner
JVM
1 Launch
sparkclr-submit.cmdor
sparkclr-submit.sh
CSharpBackendLaunch Netty server creatingproxy for JVM calls
2
C# Driver
Launch C# processusing port number from CSharpBackend
3
CLR
SparkConf SparkContext
Create and manage
Proxies for JVM objects
SparkConf SparkContext
Interop Components
Mirror C#-side operations
Invoke JVM methods
RDD DataFrame DStream …CSharpRDD
RDD DataFrame DStream PipelinedRDD …
1
Compute
2
CLR
CSharpWorker.exe
Launch
Worker-side Interop
JVM
CSharpRDD
Executor
Spark Worker
3Read bytes
5Write bytes 4
Execute C# operation
1
Compute
Mobius Project Info• https://github.com/Microsoft/Mobius
• MIT license
• Discussions• StackOverflow: tag “SparkCLR”• Gitter: https://gitter.im/Microsoft/Mobius• Twitter: @MobiusForSpark
Mobius Project Status• Past Releases
• v1.5.200 (Spark 1.5.2)• v1.6.100 (Spark 1.6.1)
• Upcoming Releases• V1.6.200 (Spark 1.6.2)• v2.0.000 (Spark 2.0.0)
• Work planned/in progress• Support for interactive scenarios (Zeppelin/Jupyter integration – IfSharp?)• Exploration of support for ML scenarios• Idiomatic F# API (?)• Support for .NET Core
Thank youMobius is production-ready & cloud-readyUse Mobius to build Apache Spark jobs in .NETContribute to github.com/Microsoft/Mobius@MobiusForSpark