Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

51
Intro to Apache Spark: Fast cluster computing engine for Hadoop Intro to Scala: Object-oriented and functional language for the Java Virtual Machine ACM SIGKDD, 7/9/2014 Roger Huang Lead System Architect rohuang @visa.com [email protected] @BigDataWrangler

Transcript of Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

Page 1: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

Intro to Apache Spark:Fast cluster computing engine for Hadoop

Intro to Scala:Object-oriented and functional language for the Java Virtual Machine

ACM SIGKDD, 7/9/2014

Roger Huang

Lead System Architect

[email protected]

[email protected]

@BigDataWrangler

Page 2: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

2Intro to Spark: Intro to Scala | 7/9/2014

About me: Roger Huang• Visa

– Digital & Mobile Products Architecture, Strategic Projects & infrastructure

– Search infrastructure

– Customer segmentation

– Logging Framework

– Splunk on Hadoop (Hunk)

– Real-time monitoring

– Data

• PayPal

– Java Infrastructure

Page 3: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

3Intro to Spark: Intro to Scala | 7/9/2014

Different perspectives on an elephant Scala

Page 4: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

4Intro to Spark: Intro to Scala | 7/9/2014

Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect

Page 5: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

5Intro to Spark: Intro to Scala | 7/9/2014

Spark in the Hadoop ecosystem

Page 6: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

6Intro to Spark: Intro to Scala | 7/9/2014

Spark Ecosystem of Software Projects

• Spark [Ognen]

– APIs: Scala, Python [Robert], Java

• “SQL”

– Shark (Hive + Spark) [Roger]

– SparkSQL (alpha)

• Machine Learning Library (MLlib) [Omar]

– Clustering

– Classification

• binary classification

• Linear regression

– recommendations

• Spark Streaming [Chance]

• GraphX [Srini]

• …

Page 7: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

7Intro to Spark: Intro to Scala | 7/9/2014

Resilient Distributed Dataset

• Fault tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

• Data sources for RDDs

– Parallelized collections

• From Scala collections

– Hadoop datasets

• From HDFS, any Hadoop supported storage system (Hbase, Amazon S3, …)

• Text files, SequenceFile, any Hadoop InputFormat

• Two types of operations

– Transformation

• takes an existing dataset and creates a new one

– Action

• takes a dataset, run a computation, and return value to driver program

Page 8: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

8Intro to Spark: Intro to Scala | 7/9/2014

(Some) RDD Operations• Transformations

– map(func)

– filter(func)

– flatMap(func)

– mapPartitions(func)

– mapPartitionsWithIndex(func)

– sample(withReplacement, fraction, seed)

– union(otherDataset)

– distinct()

– groupByKey()

– reduceByKey(func)

– sortByKey()

– Join(otherDataset)

– cogroup(otherDataset)

– cartesian(otherDataset)

• Actions

– reduce(func)

– collect()

– count()

– first()

– take(n)

– takeSample(withReplacement, num, seed)

– saveAsTextFile(path)

– saveAsSequenceFile(path)

– countByKey()

– foreach(func)

– …

Page 9: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

9Intro to Spark: Intro to Scala | 7/9/2014

Scala background• Scalable, Object oriented, functional language

– Version 2.11 (4/2014)

• Runs on the Java Virtual Machine

• Martin Odersky

– javac

– Java generics

• http://scala-lang.org/, REPL

• http://www.scala-lang.org/api/current

• http://scala-ide.org/

• http://www.scala-sbt.org/, Simple build tool

• Who’s using Scala?

– Twitter, LinkedIn, …

• Powered by Scala

– Apache Spark, Apache Kafka, Akka,…

Page 10: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

10Intro to Spark: Intro to Scala | 7/9/2014

Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Hadoop/Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect

Page 11: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

11Intro to Spark: Intro to Scala | 7/9/2014

Scala for the computer scientist: functional programming (FP)

Page 12: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

12Intro to Spark: Intro to Scala | 7/9/2014

Scala for the computer scientist: functional programming (FP)

• Math functions, e.g., f(x) = y

– A function has a single responsibility

– A function has no side effects

– A function is referentially transparent

• A function outputs the same value for the same inputs.

• Functional programming

– expresses computation as the evaluation and composition of mathematical functions

– Avoid side effects and mutating state data

Page 13: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

13Intro to Spark: Intro to Scala | 7/9/2014

Why functional programming?

• Multi core processors

• Concurrency

– Computation as a series of independent data transformations

– Parallel data transformations without side effects

• Referential transparency

Page 14: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

14Intro to Spark: Intro to Scala | 7/9/2014

Scala for the computer scientist: functional programming

• Functions

– Lambda, closure

• For-comprehensions

• Type inference

• Pattern matching

• Higher order functions

– map, flatMap, foldLeft

• And more …

Page 15: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

15Intro to Spark: Intro to Scala | 7/9/2014

FP: functions

• Anonymous function

– Function without a name

– lambda function

• Example

– scala> List(100, 200, 300) map { _ * 10/100}

– res0: List[Int] = List(10, 20, 30)

• Closure (Wikipedia)

– Closure = A function, together with a referencing environment – a table storing a reference to each of the non-local variables of that function.

– A closure allows a function to access those non-local variables even when invoked outside its immediate lexical scope.

Page 16: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

16Intro to Spark: Intro to Scala | 7/9/2014

FP: functions

• applyPercentage is an example of a closure

– scala> var percentage = 10

– percentage: Int = 10

– scala> val applyPercentage = (amount: Int) => amount * percentage / 100

– applyPercentage: Int => Int = <function1>

– scala> percentage = 20

– percentage: Int = 20

– scala> List (100, 200, 300) map applyPercentage

– res1: List[Int] = List(20, 40, 60)

– scala>

Page 17: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

17Intro to Spark: Intro to Scala | 7/9/2014

FP: functions

• Anonymous function

• Closure

Page 18: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

18Intro to Spark: Intro to Scala | 7/9/2014

FP: Higher order functionsscala> :load Person.scala

Loading Person.scala...

defined class Person

scala> val jd = new Person("John", "Doe", 17)

jd: Person = Person@372a6e85

scala> val rh = new Person("Roger", "Huang", 34)

rh: Person = Person@611c4041

scala> val people = Array(jd, rh)

people: Array[Person] = Array(Person@372a6e85, Person@611c4041)

scala> val (minors, adults) = people partition (_.age < 18)

minors: Array[Person] = Array(Person@372a6e85)

adults: Array[Person] = Array(Person@611c4041)

scala>

Page 19: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

19Intro to Spark: Intro to Scala | 7/9/2014

FP: Higher order functions

• HOF

– takes a function as an argument

– Returns a function

Page 20: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

20Intro to Spark: Intro to Scala | 7/9/2014

FP: Higher order functions: map

• Creates a new collection from an existing collection by applying a function

• Anonymous function

scala> List(1, 2, 3 ) map { (x: Int) => x + 1 }

res0: List[Int] = List(2, 3, 4)

• Function literal

scala> List(1, 2, 3) map { _ + 1 }

res1: List[Int] = List(2, 3, 4)

• Passing an existing function

scala> def addOne(num: Int) = num + 1

addOne: (num: Int)Int

scala> List(1, 2, 3) map addOne

res2: List[Int] = List(2, 3, 4)

Page 21: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

21Intro to Spark: Intro to Scala | 7/9/2014

FP: Higher order functions: map

Page 22: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

22Intro to Spark: Intro to Scala | 7/9/2014

FP: Higher order functions: flatmap

Page 23: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

23Intro to Spark: Intro to Scala | 7/9/2014

FP: for-comprehension

• Syntax

– for ( <generator> | <guard> ) <expression> [yield] <expression>

• Types

– Imperative form. Does not return a value.

scala> val aList = List(1, 2, 3)

aList: List[Int] = List(1, 2, 3)

scala> val bList = List(4, 5, 6)

bList: List[Int] = List(4, 5, 6)

scala> for { a <- aList; if (a < 2); b <- bList; if (b < 7) } println( a + b )

5

6

7

Page 24: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

24Intro to Spark: Intro to Scala | 7/9/2014

FP: for-comprehension

• Syntax

– for ( <generator> | <guard> ) <expression> [yield] <expression>

• Types

– Functional form (a.k.a., sequence comprehension) . Returns/yields a value

scala> for { a <- aList; b <- bList} yield a + b

res0: List[Int] = List(5, 6, 7, 6, 7, 8, 7, 8, 9)

scala> res0.take(1)

res1: List[Int] = List(5)

scala> for { a <- aList; if (a < 2); b <- bList } yield a + b

res2: List[Int] = List(5, 6, 7)

scala>

Page 25: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

25Intro to Spark: Intro to Scala | 7/9/2014

FP: for-comprehension

Page 26: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

26Intro to Spark: Intro to Scala | 7/9/2014

FP: foldLeft• scala> val numbers = 1.to(10)

• numbers: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

• scala> def add( a:Int, b:Int ): Int = { a + b }

• add: (a: Int, b: Int)Int

• scala> numbers.foldLeft(0){ add }

• res0: Int = 55

• scala> numbers.foldLeft(0){ (acc, b) => acc + b }

• res1: Int = 55

• scala>

Page 27: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

27Intro to Spark: Intro to Scala | 7/9/2014

FP: foldLeft

Page 28: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

28Intro to Spark: Intro to Scala | 7/9/2014

FP: find the last item in an array

• scala> val ns = Array(20, 40, 60)

• ns: Array[Int] = Array(20, 40, 60)

• scala> ns.foldLeft(ns.head) {(acc, b) => b}

• res0: Int = 60

• scala>

Page 29: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

29Intro to Spark: Intro to Scala | 7/9/2014

FP: reverse an array w/ foldLeft

• scala> val ns = Array(20, 40, 60)

• ns: Array[Int] = Array(20, 40, 60)

• scala> ns.foldLeft( Array[Int]() ) { (acc, b) => b +: acc}

• res1: Array[Int] = Array(60, 40, 20)

• scala>

Page 30: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

30Intro to Spark: Intro to Scala | 7/9/2014

FP: reverse an array w/ foldLeft

Page 31: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

31Intro to Spark: Intro to Scala | 7/9/2014

Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect

Page 32: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

32Intro to Spark: Intro to Scala | 7/9/2014

Scala for the Java / OO developer: • Interoperable w/ Java

• Case classes

• Mixins with traits

Page 33: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

33Intro to Spark: Intro to Scala | 7/9/2014

Scala for the Java / OO developer: • case class

– Implements equals(), hashCode(), toString()

– Can be used in Pattern Matching

Page 34: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

34Intro to Spark: Intro to Scala | 7/9/2014

Scala for the Java / OO developer: • http://docs.oracle.com/javase/8/docs/api/java/util/stream/Str

eam.html

• map

– <R> Stream<R> map(Function<? super T,? extends R> mapper)Returns a stream consisting of the results of applying the given function to the elements of this stream.This is an intermediate operation.

• flatMap

– <R> Stream<R> flatMap(Function<? super T,? extends Stream<? extends R>> mapper)Returns a stream consisting of the results of replacing each element of this stream with the contents of a mapped stream produced by applying the provided mapping function to each element. Each mapped stream is closed after its contents have been placed into this stream. (If a mapped stream is null an empty stream is used, instead.)This is an intermediate operation.

`

Page 35: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

35Intro to Spark: Intro to Scala | 7/9/2014

Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect

Page 36: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

36Intro to Spark: Intro to Scala | 7/9/2014

Scala for the Spark developer• ResilientDistributedDataset (RDD)

• A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist.

• http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

Page 37: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

37Intro to Spark: Intro to Scala | 7/9/2014

Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect

Page 38: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

38Intro to Spark: Intro to Scala | 7/9/2014

Scala for the Big Data developer• Spark

– Programming API in Scala

– Implemented in Scala

• Scalding

– Scala DSL on top of Cascading

– data processing API and processing query planner used for defining, sharing, and executing data-processing workflows

– Abstractions: tuples, pipes, source/sink taps

• Algebird

• Summingbird

– Library that lets you write MapReduce programs that look like native Scala or Java collection transformations

– Execute them on a number of well-known distributed MapReduceplatforms, including Storm and Scalding.

Page 39: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

39Intro to Spark: Intro to Scala | 7/9/2014

Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Hadoop/Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect

Page 40: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

40Intro to Spark: Intro to Scala | 7/9/2014

Scala for the Big Data scientist / mathematician

• Monoid

– If you want to “attach” operations such as +, -, *, / or <= to data objects (e.g., Bloom filters), then you want to provide monoid forms of those data objects

– Consists of

• A set of objects

• Binary operation that satisfies the monoid axioms

• Monad

– If you want to create a data processing pipeline that transforms the state of a data object

– composition

Page 41: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

41Intro to Spark: Intro to Scala | 7/9/2014

Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Hadoop/Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect

Page 42: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

42Intro to Spark: Intro to Scala | 7/9/2014

Scala for the system architect• Concurrency

• Problem:

– Threads

– Shared mutable state

– Locks,

• Solution:

– message passing concurrency w/ Actors

– Future, Promise

• Abstractions

– Actor

• an object that processes a message

• encapsulates state (state not shared)

– ActorRef

– Message, usually sent asynchronously

– Mailbox

– ActorSystem

Page 43: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

43Intro to Spark: Intro to Scala | 7/9/2014

Scala for the system architect: Akka• Fault tolerance

– Supervision

– Strategies

• Resume, restart, stop, escalate, …

• Scale out: remote actors

– Via configuration

Page 44: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

44Intro to Spark: Intro to Scala | 7/9/2014

Scala for the system architect• Parallel collections

– scala> import scala.collection.parallel.immutable._

– import scala.collection.parallel.immutable._

– scala> ParVector(10, 20, 30, 40, 50, 60, 70, 80, 90) .map { x =>

– | println( Thread.currentThread.getName); x / 2 }

– ForkJoinPool-1-worker-13

– ForkJoinPool-1-worker-1

– ForkJoinPool-1-worker-1

– ForkJoinPool-1-worker-9

– ForkJoinPool-1-worker-11

– ForkJoinPool-1-worker-5

– ForkJoinPool-1-worker-3

– ForkJoinPool-1-worker-15

– ForkJoinPool-1-worker-7

– res0: scala.collection.parallel.immutable.ParVector[Int] = ParVector(5, 10, 15,

– 20, 25, 30, 35, 40, 45)

– scala>

Page 45: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

45Intro to Spark: Intro to Scala | 7/9/2014

Sequential collections

Page 46: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

46Intro to Spark: Intro to Scala | 7/9/2014

Parallel collections

Page 47: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

47Intro to Spark: Intro to Scala | 7/9/2014

Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect

Page 48: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

48Intro to Spark: Intro to Scala | 7/9/2014

Different perspectives on an elephant Scala

Page 49: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

49Intro to Spark: Intro to Scala | 7/9/2014

Spark in the Hadoop ecosystem

Page 50: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

50Intro to Spark: Intro to Scala | 7/9/2014

References• http://scala-lang.org/

• Scala in Action, Nilanjan Raychaudhuri

• Grokking Functional Programming, Aslam Khan

• Michael Noll

Page 51: Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

Intro to Apache Spark:Fast cluster computing engine for Hadoop

Intro to Scala:Object-oriented and functional language for the Java Virtual Machine

ACM SIGKDD, 7/9/2014

Roger Huang

Lead System Architect

Digital & Mobile Products Architecture

[email protected]

[email protected]