Developing apache spark jobs in .net using mobius

19
Developing Apache Spark Jobs in .NET using Mobius Kaarthik Sivashanmugam @kaarthikss dotnetfringe 201

Transcript of Developing apache spark jobs in .net using mobius

Page 1: Developing apache spark jobs in .net using mobius

Developing Apache Spark Jobs in .NET using

MobiusKaarthik Sivashanmugam

@kaarthikss

dotnetfringe 2016

Page 2: Developing apache spark jobs in .net using mobius

Apache Spark• General purpose cluster computing system for big data processing

and analytics• Ease of programming• High performance• Unified API to solve a diverse set of complex data problems• API in Scala, Java, Python & R

Page 3: Developing apache spark jobs in .net using mobius

Apache Spark Key Concepts• Data• RDD – Resilient Distributed Dataset• Transformation & Action• DataFrame• Dstream

• Cluster• Driver• Executor

Page 4: Developing apache spark jobs in .net using mobius

Mobius: C# API for Spark• Enable organizations invested deeply in .NET to build Apache Spark

applications in C#

• Reuse of existing .NET libraries in Spark applications

Page 5: Developing apache spark jobs in .net using mobius

.NET & Spark

Scala/Java API

SparkR PySpark

Mobius: C# API

Apache Spark

Spark Apps in .NET

Page 6: Developing apache spark jobs in .net using mobius

Word Count in Spark using RDDScala

RDD of lines in the fileRDD of words in the file

RDD of tuple - (word, 1)RDD of tuple - (word, count)

Action that triggers job

Page 7: Developing apache spark jobs in .net using mobius

Word Count in Spark using RDD

C#

Scala

F#

Page 8: Developing apache spark jobs in .net using mobius

Develop & Launch Mobius Applications

Spark Client

A

Get Mobius release

B

Get Mobius driverand dependencies

1Add Reference toMobius package in NuGet

2Develop, debug, testMobius driver application

3Build Mobius driver

Runsparkclr-submit.cmd

orsparkclr-submit.sh C

Runs Spark job

Example: sparkclr-submit.cmd --master spark://IP:PORT --total-executor-cores 200--executor-memory 12g -- conf spark.eventLog.enabled=true-- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog--exe Pi.exe D:\Mobius\examples\Pi

Page 9: Developing apache spark jobs in .net using mobius

DemoImplementing a simple Mobius driver program using DataFrame

Page 10: Developing apache spark jobs in .net using mobius

Structured Data in Mobius using DataFrame

JSON Cassandra

Note – Dataset is replacing DataFrame in Spark

Page 11: Developing apache spark jobs in .net using mobius

Mobius & Spark

C# Worker

CLR

IPC Sockets

C# Worker

CLR

IPC Sockets

C# Worker

CLR

IPC Sockets

C# Driver

CLR

IPC Sockets

SparkExecutor

SparkExecutor

SparkExecutor

SparkContext

JVM

JVM

JVM

JVM

Workers

Driver

Mobius can be used with any existing Spark cluster(Standalone, YARN) inWindows & Linux

Page 12: Developing apache spark jobs in .net using mobius

Mobius in Linux• Mono is used for using Mobius with Spark in Linux

• Mobius project CI (build, unit & functional tests) in Ubuntu

• Mobius validated in Ubuntu, CentOS, OSX

• Mobius validated with Spark clusters in Azure HDInsight and Amazon Web Services EMR

• More info at linux-instructions.md @ GitHub

Page 13: Developing apache spark jobs in .net using mobius

Kafka Message Processing in Mobius using DStream

Initialize StreamingContext & Checkpoint

Create Kafka DStream

Use DStream transformations to count logs by loglevel within a time window

Save log count

Start stream processing

Page 14: Developing apache spark jobs in .net using mobius

Internals of Driver & Worker

Page 15: Developing apache spark jobs in .net using mobius

Driver-side Interop

CSharpRunner

JVM

1 Launch

sparkclr-submit.cmdor

sparkclr-submit.sh

CSharpBackendLaunch Netty server creatingproxy for JVM calls

2

C# Driver

Launch C# processusing port number from CSharpBackend

3

CLR

SparkConf SparkContext

Create and manage

Proxies for JVM objects

SparkConf SparkContext

Interop Components

Mirror C#-side operations

Invoke JVM methods

RDD DataFrame DStream …CSharpRDD

RDD DataFrame DStream PipelinedRDD …

Page 16: Developing apache spark jobs in .net using mobius

1

Compute

2

CLR

CSharpWorker.exe

Launch

Worker-side Interop

JVM

CSharpRDD

Executor

Spark Worker

3Read bytes

5Write bytes 4

Execute C# operation

1

Compute

Page 17: Developing apache spark jobs in .net using mobius

Mobius Project Info• https://github.com/Microsoft/Mobius

• MIT license

• Discussions• StackOverflow: tag “SparkCLR”• Gitter: https://gitter.im/Microsoft/Mobius• Twitter: @MobiusForSpark

Page 18: Developing apache spark jobs in .net using mobius

Mobius Project Status• Past Releases

• v1.5.200 (Spark 1.5.2)• v1.6.100 (Spark 1.6.1)

• Upcoming Releases• V1.6.200 (Spark 1.6.2)• v2.0.000 (Spark 2.0.0)

• Work planned/in progress• Support for interactive scenarios (Zeppelin/Jupyter integration – IfSharp?)• Exploration of support for ML scenarios• Idiomatic F# API (?)• Support for .NET Core

Page 19: Developing apache spark jobs in .net using mobius

Thank youMobius is production-ready & cloud-readyUse Mobius to build Apache Spark jobs in .NETContribute to github.com/Microsoft/Mobius@MobiusForSpark