Spark Summit EU talk by Kaarthik Sivashanmugam

25
Build Your Next Apache Spark Job in .NET Using Mobius Kaarthik Sivashanmugam @kaarthikss

Transcript of Spark Summit EU talk by Kaarthik Sivashanmugam

Page 1: Spark Summit EU talk by Kaarthik Sivashanmugam

Build Your Next Apache Spark Job in .NET Using Mobius

Kaarthik Sivashanmugam@kaarthikss

Page 2: Spark Summit EU talk by Kaarthik Sivashanmugam

Mobius

C# API for building Apache Spark applications in .NET

Page 3: Spark Summit EU talk by Kaarthik Sivashanmugam

Motivation• Enable organizations invested deeply in .NET to

build Apache Spark applications in C#

• Reuse of existing .NET libraries in Spark applications

Page 4: Spark Summit EU talk by Kaarthik Sivashanmugam

Yet Another Language Binding?

Popularity of C#• StackOverflow.com Developer Survey• RedMonk Programming Language Rankings

.NET ecosystem ~ enabling languages like F#

Spark Survey Results

MOST IMPORTANT ASPECTS OF SPARK

2015 2016

FASTEST GROWING AREAS FROM 2014 TO 2015

2015

2016

Page 5: Spark Summit EU talk by Kaarthik Sivashanmugam

Mobius & Spark

Scala/Java API

SparkR PySpark

C# API

Apache Spark

Spark Apps in C#

Page 6: Spark Summit EU talk by Kaarthik Sivashanmugam

Word Count

C#

Scala

F#

Page 7: Spark Summit EU talk by Kaarthik Sivashanmugam

Develop & Launch Mobius Applications

Spark Client

A

Get Mobius release

B

Get Mobius driverand dependencies

1Add Reference toMobius package in NuGet

2Develop, debug, testMobius driver application

3Build Mobius driver

Runsparkclr-submit.cmd

orsparkclr-submit.sh

CRuns Spark job

Example: sparkclr-submit.cmd --master spark://IP:PORT --total-executor-cores 200--executor-memory 12g --exe Pi.exe D:\Mobius\examples\Pi

Page 8: Spark Summit EU talk by Kaarthik Sivashanmugam

Mobius & Spark

C# Worker

CLR

IPC Sockets

C# Worker

CLR

IPC Sockets

C# Worker

CLR

IPC Sockets

C# Driver

CLR

IPC Sockets

SparkExecutor

SparkExecutor

SparkExecutor

SparkContext

JVM

JVM

JVM

JVM

Workers

Driver

Mobius can be used with any existing Spark cluster(Standalone, YARN) inWindows & Linux

Page 9: Spark Summit EU talk by Kaarthik Sivashanmugam

DEMO – RUNNING MOBIUS APP

Page 10: Spark Summit EU talk by Kaarthik Sivashanmugam

sparkclr-submit.cmd> sparkclr-submit.cmd

--exe SparkClrWordCount.exe C:\spark-clr_2.10-1.6.200\examples\Batch\WordCount

C:\temp\wcdata.txt

sparkclr-submit.sh in Linux

Page 11: Spark Summit EU talk by Kaarthik Sivashanmugam

DEMO – DEBUGGING MOBIUS APP

Page 12: Spark Summit EU talk by Kaarthik Sivashanmugam

Debug Mode> sparkclr-submit.cmd debug

Debug Mobius Word Count Example in VS

(set CSharpWorkerPath in config)

Page 13: Spark Summit EU talk by Kaarthik Sivashanmugam

DEMO – USING MOBIUS C# SHELL

Page 14: Spark Summit EU talk by Kaarthik Sivashanmugam

sparkclr-shell.cmd> var rdd = sc.Parallelize(Enumerable.Range(0, 100), 2);> rdd.Reduce((x,y) => x+y) //prints sum of the values;

Page 15: Spark Summit EU talk by Kaarthik Sivashanmugam

DEMO – USING F# SHELL

Page 16: Spark Summit EU talk by Kaarthik Sivashanmugam

fsi.exe> sparkclr-submit.cmd debug

> fsi --use:C:\temp\mobius-init.fsx

let dataframe = session.Read().Json(@"C:\temp\data.json");;

dataframe.Show();;dataframe.ShowSchema();;dataframe.Count();;

Page 17: Spark Summit EU talk by Kaarthik Sivashanmugam

Kafka Streaming ExampleInitialize StreamingContext & Checkpoint

Create Kafka DStream

Use DStream transformations to count logs by loglevel within a time window

Save log count

Start stream processing

Page 18: Spark Summit EU talk by Kaarthik Sivashanmugam

Mobius in Linux• Mono is used for using Mobius with Spark in Linux

• Mobius project CI (build, unit & functional tests) in Ubuntu

• Mobius validated in Ubuntu, CentOS, OSX

• Mobius validated with Spark clusters in Azure HDInsight and Amazon Web Services EMR

• More info at linux-instructions.md @ GitHub

Page 19: Spark Summit EU talk by Kaarthik Sivashanmugam

Project Info• https://github.com/Microsoft/Mobius Contributions

welcome!

• MIT license

• Discussions– StackOverflow: tag “SparkCLR”– Gitter: https://gitter.im/Microsoft/Mobius– Twitter: @MobiusForSpark

Page 20: Spark Summit EU talk by Kaarthik Sivashanmugam

CSharpRDD• C# operations use CSharpRDD which needs CLR to execute

– If no C# transformation or UDF, CLR is not needed ~ execution is entirely JVM-based

• RDD<byte[]>– Data is stored as serialized objects and sent to C# worker process

• Transformations are pipelined when possible– Avoids unnecessary serialization & deserialization within a stage

Page 21: Spark Summit EU talk by Kaarthik Sivashanmugam

Performance Considerations• Map & Filter RDD operations in C# require serialization & deserialization of

data ~ impacts performance– C# operations are pipelined when possible ~ minimizes Ser/De– Persistence is handled by JVM ~ checkpoint/cache on a RDD impacts pipelining for

CLR operations

• DataFrame operations without C# UDFs do not require Ser/De– Perf will be same as native Scala-based Spark application– Execution plan optimization & code generation perf improvements in Spark leveraged

Page 22: Spark Summit EU talk by Kaarthik Sivashanmugam

INTERNALS OF DRIVER & WORKER

Page 23: Spark Summit EU talk by Kaarthik Sivashanmugam

Driver-side Interop

CSharpRunner

JVM

1 Launch

sparkclr-submit.cmdor

sparkclr-submit.sh

CSharpBackendLaunch Netty server creatingproxy for JVM calls

2

C# Driver

Launch C# processusing port number from CSharpBackend

3

CLR

SparkConf SparkContext

Create and manage

Proxies for JVM objects

SparkConf SparkContext

Interop Components

Mirror C#-side operations

Invoke JVM methods

RDD DataFrame DStream …CSharpRDD

RDD DataFrame DStream PipelinedRDD …

Page 24: Spark Summit EU talk by Kaarthik Sivashanmugam

1

Compute

2

CLR

CSharpWorker.exe

Launch

Worker-side Interop

JVM

CSharpRDD

Executor

Spark Worker

3Read bytes

5Write bytes 4

Execute C# operation

1

Compute

Page 25: Spark Summit EU talk by Kaarthik Sivashanmugam

THANK YOU.• Mobius is production-ready & Cloud-ready• Use Mobius to build Apache Spark jobs in .NET• Contribute to github.com/Microsoft/Mobius• @MobiusForSpark