Building Machine Learning Algorithms on Apache Spark with William Benton

100
Building machine learning algorithms on Apache Spark William Benton (@willb) Red Hat, Inc. Session hashtag: #EUds5

Transcript of Building Machine Learning Algorithms on Apache Spark with William Benton

Page 1: Building Machine Learning Algorithms on Apache Spark with William Benton

Building machine learning algorithms on Apache Spark

William Benton (@willb) Red Hat, Inc.

Session hashtag: #EUds5

Page 2: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Motivation

2

Page 3: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Motivation

3

Page 4: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Motivation

4

Page 5: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Motivation

5

Page 6: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

ForecastIntroducing our case study: self-organizing maps

Parallel implementations for partitioned collections (in particular, RDDs)

Beyond the RDD: data frames and ML pipelines

Practical considerations and key takeaways

6

Page 7: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Introducing self-organizing maps

Page 8: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5 8

Page 9: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5 8

Page 10: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training self-organizing maps

9

Page 11: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training self-organizing maps

10

Page 12: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training self-organizing maps

11

Page 13: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training self-organizing maps

11

Page 14: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training self-organizing maps

12

while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt

Page 15: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training self-organizing maps

12

while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt

process the training set in random order

Page 16: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training self-organizing maps

12

while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt

process the training set in random order

the neighborhood size controls how much of the map around

the BMU is affected

Page 17: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training self-organizing maps

12

while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt

process the training set in random order

the neighborhood size controls how much of the map around

the BMU is affected

the learning rate controls how much closer to the example each unit gets

Page 18: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Parallel implementations for partitioned collections

Page 19: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Historical aside: Amdahl’s Law

14

11 - plim So =

sp —> ∞

Page 20: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

What forces serial execution?

15

Page 21: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

What forces serial execution?

16

Page 22: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

What forces serial execution?

17

state[t+1] = combine(state[t], x)

Page 23: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

What forces serial execution?

18

state[t+1] = combine(state[t], x)

Page 24: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

What forces serial execution?

19

f1: (T, T) => T f2: (T, U) => T

Page 25: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

What forces serial execution?

19

f1: (T, T) => T f2: (T, U) => T

Page 26: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

What forces serial execution?

19

f1: (T, T) => T f2: (T, U) => T

Page 27: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

How can we fix these?

20

a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

Page 28: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

How can we fix these?

21

a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

Page 29: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

How can we fix these?

22

a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

Page 30: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

How can we fix these?

23

a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

Page 31: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

How can we fix these?

24

L-BGFSSGD

a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

Page 32: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

How can we fix these?

25

SGD L-BGFS

a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

There will be examples of each of these approaches for many problems in the literature and in open-source code!

Page 33: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Implementing atop RDDsWe’ll start with a batch implementation of our technique:

26

for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)

Page 34: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Implementing atop RDDs

27

for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)

Each batch produces a model that can be averaged with other models

Page 35: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Implementing atop RDDs

28

for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)

Each batch produces a model that can be averaged with other models

partition

Page 36: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Implementing atop RDDs

29

for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)

This won’t always work!

Page 37: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

An implementation template

30

var nextModel = initialModel for (int i = 0; i < iterations; i++) {

val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)

}

Page 38: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

An implementation template

31

var nextModel = initialModel for (int i = 0; i < iterations; i++) {

val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)

}

var nextModel = initialModel for (int i = 0; i < iterations; i++) {

val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)

}

“fold”: update the state for this partition with a

single new example

Page 39: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

An implementation template

32

var nextModel = initialModel for (int i = 0; i < iterations; i++) {

val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)

}

var nextModel = initialModel for (int i = 0; i < iterations; i++) {

val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)

}“reduce”: combine the

states from two partitions

Page 40: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

An implementation template

33

var nextModel = initialModel for (int i = 0; i < iterations; i++) {

val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)

}

this will cause the model object to be serialized with the closure

nextModel

Page 41: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

An implementation template

34

var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist }

broadcast the current working model for this iterationvar nextModel = initialModel

for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } get the value of the

broadcast variable

Page 42: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

An implementation template

35

var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } remove the stale

broadcasted model

var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist

Page 43: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

An implementation template

36

var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist }

var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist }

the wrong implementation of the right interface

Page 44: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (aggregate)

Implementing on RDDs

37

⊕ ⊕ ⊕ ⊕ ⊕ ⊕

⊕ ⊕ ⊕ ⊕ ⊕

⊕ ⊕ ⊕ ⊕

Page 45: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (aggregate)

Implementing on RDDs

38

⊕ ⊕ ⊕ ⊕ ⊕ ⊕

⊕ ⊕ ⊕ ⊕ ⊕

⊕ ⊕ ⊕ ⊕

Page 46: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (aggregate)

Implementing on RDDs

38

Page 47: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (treeAggregate)

Implementing on RDDs

39

⊕ ⊕

⊕ ⊕

Page 48: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (treeAggregate)

Implementing on RDDs

40

⊕ ⊕

⊕ ⊕

Page 49: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (treeAggregate)

Implementing on RDDs

41

⊕ ⊕

⊕ ⊕

Page 50: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (treeAggregate)

Implementing on RDDs

42

⊕ ⊕driver (treeAggregate)

Page 51: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (treeAggregate)

Implementing on RDDs

43

⊕ ⊕

⊕ ⊕

⊕ ⊕

Page 52: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

workersdriver (treeAggregate)

Implementing on RDDs

44

⊕ ⊕

Page 53: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

driver (treeAggregate) workers

Implementing on RDDs

45

⊕ ⊕

Page 54: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

driver (treeAggregate) workers

Implementing on RDDs

45

⊕ ⊕⊕

Page 55: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Beyond the RDD: Data frames and ML Pipelines

Page 56: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs: some good parts

47

val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect()

val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()

Page 57: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs: some good parts

48

val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect()

val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()

doesn’t compile

Page 58: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs: some good parts

49

val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect()

val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()

doesn’t compile

Page 59: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs: some good parts

50

val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect()

val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()

doesn’t compile

crashes at runtime

Page 60: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs: some good parts

51

rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) }

val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec))

df.withColumn($"predictions", predict($"features"))

Page 61: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs: some good parts

52

rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) }

val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec))

df.withColumn($"predictions", predict($"features"))

Page 62: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs versus query planning

53

val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.cartesian(numbers2) .map((x, y) => (x, y, expensive(x, y))) .filter((x, y, _) => isPrime(x), isPrime(y))

Page 63: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs versus query planning

54

val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.filter(isPrime(_)) .cartesian(numbers2.filter(isPrime(_))) .map((x, y) => (x, y, expensive(x, y)))

Page 64: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs and the JVM heap

55

val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))

Page 65: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs and the Java heap

56

val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))

class pointer flags size locks element pointer element pointer

class pointer flags size locks 1.0

class pointer flags size locks 3.0 4.0

2.0

Page 66: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs and the Java heap

57

val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))

class pointer flags size locks element pointer element pointer

class pointer flags size locks 1.0

class pointer flags size locks 3.0 4.0

2.0 32 bytes of data…

Page 67: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

RDDs and the Java heap

58

val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))

class pointer flags size locks element pointer element pointer

class pointer flags size locks 1.0

class pointer flags size locks 3.0 4.0

2.0

…and 64 bytes of overhead!

32 bytes of data…

Page 68: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

ML pipelines: a quick example

59

from pyspark.ml.clustering import KMeans

K, SEED = 100, 0xdea110c8

randomDF = make_random_df()

kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features") model = kmeans.fit(randomDF) withPredictions = model.transform(randomDF).select("x", "y", "prediction")

Page 69: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Working with ML pipelines

60

estimator.fit(df)

Page 70: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Working with ML pipelines

61

estimator.fit(df) model.transform(df)

Page 71: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Working with ML pipelines

62

model.transform(df)

Page 72: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Working with ML pipelines

63

model.transform(df)

Page 73: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Working with ML pipelines

64

estimator.fit(df) model.transform(df)

Page 74: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Working with ML pipelines

64

estimator.fit(df) model.transform(df)

inputCol

epochs

seed

Page 75: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Working with ML pipelines

64

estimator.fit(df) model.transform(df)

inputCol

epochs

seed

outputCol

Page 76: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Defining parameters

65

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

Page 77: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Defining parameters

66

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

Page 78: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

Defining parameters

67

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

Page 79: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Defining parameters

68

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

Page 80: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Defining parameters

69

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

Page 81: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Defining parameters

70

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))

final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

Page 82: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Don’t repeat yourself

71

/** * Common params for KMeans and KMeansModel */ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { /* ... */ }

Page 83: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Estimators and transformers

72

estimator.fit(df)

Page 84: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Estimators and transformers

72

estimator.fit(df)

Page 85: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Estimators and transformers

72

estimator.fit(df) model.transform(df)

Page 86: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Estimators and transformers

72

estimator.fit(df) model.transform(df)

Page 87: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Estimators and transformers

72

estimator.fit(df) model.transform(df)

Page 88: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Validate and transform at once

73

def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }

Page 89: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Validate and transform at once

74

def transformSchema(schema: StructType): StructType = { // check that the input columns exist… require(schema.fieldNames.contains($(featuresCol))) // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }

Page 90: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Validate and transform at once

75

def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type schema($(featuresCol)) match { case sf: StructField => require(sf.dataType.equals(VectorType)) } // ...and that the output columns don’t exist // ...and then make a new schema }

Page 91: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Validate and transform at once

76

def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist require(!schema.fieldNames.contains($(predictionCol))) require(!schema.fieldNames.contains($(similarityCol))) // ...and then make a new schema }

Page 92: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Validate and transform at once

77

def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema schema.add($(predictionCol), "int") .add($(similarityCol), "double") }

Page 93: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Training on data frames

78

def fit(examples: DataFrame) = { import examples.sparkSession.implicits._ import org.apache.spark.ml.linalg.{Vector=>SV} val dfexamples = examples.select($(exampleCol)).rdd.map { case Row(sv: SV) => sv } /* construct a model object with the result of training */ new SOMModel(train(dfexamples, $(x), $(y))) }

Page 94: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Practical considerationsand key takeaways

Page 95: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Retire your visibility hacks

80

package org.apache.spark.ml.hacks

object Hacks { import org.apache.spark.ml.linalg.VectorUDT val vectorUDT = new VectorUDT }

Page 96: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Retire your visibility hacks

81

package org.apache.spark.ml.linalg

/* imports, etc., are elided ... */ @Since("2.0.0") @DeveloperApi object SQLDataTypes { val VectorType: DataType = new VectorUDT val MatrixType: DataType = new MatrixUDT }

Page 97: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Caching training data

82

val wasUncached = examples.storageLevel == StorageLevel.NONE

if (wasUncached) { examples.cache() }

/* actually train here */

if (wasUncached) { examples.unpersist() }

Page 98: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Improve serial execution timesAre you repeatedly comparing training data to a model that only changes once per iteration? Consider caching norms.

Are you doing a lot of dot products in a for loop? Consider replacing these loops with a matrix-vector multiplication.

Seek to limit the number of library invocations you make and thus the time you spend copying data to and from your linear algebra library.

83

Page 99: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

Key takeawaysThere are several techniques you can use to develop parallel implementations of machine learning algorithms.

The RDD API may not be your favorite way to interact with Spark as a user, but it can be extremely valuable if you’re developing libraries for Spark.

As a library developer, you might need to rely on developer APIs and dive in to Spark’s source code, but things are getting easier with each release!

84

Page 100: Building Machine Learning Algorithms on Apache Spark with William Benton

#EUds5

@willb • [email protected] https://chapeau.freevariable.com https://radanalytics.io

Thanks!