Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark...

29
Numerical Computing with Spark Hossein Falaki

Transcript of Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark...

Page 1: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Numerical Computing with Spark

Hossein Falaki

Page 2: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Challenges of numerical computation over big data

When applying any algorithm to big data watch for

1. Correctness

2. Performance

3. Trade-off between accuracy and performance

2

Page 3: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Three Practical Examples

• Point estimation (Variance)

• Approximate estimation (Cardinality)

• Matrix operations (PageRank)

3

We use these examples to demonstrate Spark internals, data flow, and challenges of implementing algorithms for Big Data.

Page 4: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

1. Big Data Variance

The plain variance formula requires two passes over data

4

Var(X) = 1N

(xi − µ)2i=1

N

∑First pass

Second pass

Page 5: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Fast but inaccurate solution

Var(X) = E[X 2 ]− E[X]2

=x2∑N

−x∑

N⎛

⎝⎜⎞

⎠⎟

2

Can be performed in a single pass, but

Subtracts two very close and large numbers!

5

Page 6: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Accumulator Pattern

6

An object that incrementally tracks the varianceClass RunningVar { var variance: Double = 0.0 ! // Compute initial variance for numbers def this(numbers: Iterator[Double]) { numbers.foreach(this.add(_)) } ! // Update variance for a single value def add(value: Double) { ... } }

Page 7: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Parallelize for performance

7

• Distribute adding values in map phase

• Merge partial results in reduce phase

Class RunningVar { ... // Merge another RunningVar object // and update variance def merge(other: RunningVar) = { ... } }

Page 8: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Computing Variance in Spark

8

doubleRDD .mapPartitions(v => Iterator(new RunningVar(v))) .reduce((a, b) => a.merge(b))

• Use the RunningVar in Spark

• Or simply use the Spark API

doubleRDD.variance()

Page 9: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

2. Approximate Estimations

• Often an approximate estimate is good enough especially if it can be computed faster or cheaper

1. Trade accuracy with memory

2. Trade accuracy with running time

• We really like the cases where there is a bound on error that can be controlled

9

Page 10: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Cardinality Problem

10

• Using a HashSet requires ~10GB of memory

• This can be much worse in many real world applications involving large strings, such as counting web visitors

Example: Count number of unique words in Shakespeare’s work.

Page 11: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Linear Probabilistic Counting

1. Allocate a bitmap of size m and initialize to zero.

A. Hash each value to a position in the bitmap

B. Set corresponding bit to 1

2. Count number of empty bit entries: v

11

count ≈ −m ln vm

Page 12: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

The Spark API

12

rdd .mapPartitions(v => Iterator(new LPCounter(v))) .reduce((a, b) => a.merge(b)).getCardinality

• Use the LogLinearCounter in Spark

• Or simply use the Spark API

myRDD.countApproxDistinct(0.01)

Page 13: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

3. Google PageRank

13

Popular algorithm originally introduced by Google

Page 14: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

PageRank Algorithm

• Start each page with a rank of 1

• On each iteration:

14

PageRank Algorithm

contrib = curRank| neighbors |

curRank = 0.15 + 0.85 contribineighbors∑

A.

B.

Page 15: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

PageRank Example

15

1.0

1.0

1.0

1.0

Page 16: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

PageRank Example

16

1.0

1.0

1.0

1.0

1.0

0.5 0.5

0.5

1.0

Page 17: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

PageRank Example

17

0.58

0.58

1.85

1.0

Page 18: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

PageRank Example

18

0.58

0.58

1.85

1.0

0.58

0.29 0.5

0.5

1.85

Page 19: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

PageRank Example

19

0.58

0.39

1.31

1.72

Page 20: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

PageRank Example

20

0.73

0.46

1.44

1.37Eventually

Page 21: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

PageRank as Matrix Multiplication

• Rank of each page is the probability of landing on that page for a random surfer on the web

• Probability of visiting all pages after k steps is

21

Vk = Ak ×V t

V: the initial rank vector A: the link structure (sparse matrix)

Page 22: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Data Representation in Spark

22

• Each page is identified by its unique URL rather than an index

• Ranks vectors (V): RDD[(URL, Double)]

• Links matrix (A): RDD[(URL, List(URL))]

Page 23: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Spark Implementation

23

val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs !for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

Page 24: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Matrix Multiplication• Repeatedly multiply sparse matrix and vector

24

Links (url, neighbors)

Ranks (url, rank)

iteration 1 iteration 2 iteration 3

Same file read!over and over

Page 25: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Spark can do much better

25

• Using cache(), keep neighbors in memory • Do not write intermediate results on disk

Links (url, neighbors)

Ranks (url, rank)

join

join

join

Grouping same RDD!over and over

Page 26: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Spark can do much better

26

• Do not partition neighbors every time

Links (url, neighbors)

Ranks (url, rank)

join

join

join

partitionBy

Same node

Page 27: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Spark Implementation

27

val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs !links.partitionBy(hashFunction).cache() !for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

Page 28: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

ConclusionsWhen applying any algorithm to big data watch for

1. Correctness

2. Performance

• Cache RDDs to avoid I/O

• Avoid unnecessary computation

3. Trade-off between accuracy and performance

28

Page 29: Numerical Computing with Spark - Stanford Universityrezab/sparkworkshop/slides/hossein.pdf · Spark Implementation 27 val links = // load RDD of (url, neighbors) pairs var ranks =

Numerical Computing with Spark