Co-occurrence Based Recommendations with Mahout, Scala and Spark

40
Co-occurrence-based recommendations with Mahout, Scala & Spark Sebastian Schelter @sscdotopen BigData Beers 05/29/2014

Transcript of Co-occurrence Based Recommendations with Mahout, Scala and Spark

Co-occurrence-based recommendations with Mahout, Scala & Spark

Sebastian Schelter @sscdotopen

BigData Beers

05/29/2014

Cooccurrence Analysis

History matrix

// real usecase: load from DFS

// val A = drmFromHDFS(...)

// our toy example

val A = drmParallelize(dense(

(1, 1, 1, 0), // Alice

(1, 0, 1, 0), // Bob

(0, 0, 1, 1)), // Charles

numPartitions = 2)

How often do items co-occur?

How often do items co-occur?

// compute co-occurrence matrix

val C = A.t %*% A

Which cooccurences are interesting?

Which cooccurences are interesting?

// compute some statistics

val interactionsPerItem =

drmBroadcast(A.colSums)

// convert to indicator matrix

val I = C.mapBlock() {

// compute LLR scores from

// cooccurrences and statistics

...

// only keep interesting cooccurrences

...

}

// save indicator matrix

I.writeDrm(...);

Cooccurrence Analysis prototype available

• MAHOUT-1464 provides full-fledged cooccurrence analysis protoype

– applies selective downsampling to make computation tractable

– support for cross-recommendations in datasets with multiple interaction types, e.g.

• “people who watch this video also watch those videos”

• “people who enter this search query watch those videos”

– code to run this on the Movielens and Epinions datasets

• future plan: easy indexing of indicator matrix with Apache Solr to allow for search-as-recommendation deployments– prototype for MR code already existing at https://github.com/pferrel/solr-recommender

– integration is in the works

Under the covers

Underlying systems

• currently: runtime based on Apache Spark

– fast and expressive cluster computing system

– general computation graphs, in-memory primitives, rich API, interactive shell

• potentially supported in the future: • Apache Flink (formerly: “Stratosphere”)

• H20

Runtime & Optimization

• Execution is defered, user composes logical operators

• Computational actions implicitly trigger optimization (= selection of physical plan) and execution

• Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths

• e. g.: matrix multiplication:– 5 physical operators for drmA %*% drmB– 2 operators for drmA %*% inMemA– 1 operator for drm A %*% x – 1 operator for x %*% drmA

val C = A.t %*% A

I.writeDrm(path);

val inMemV =(U %*% M).collect

Optimization Example

• Computation of ATA in example

• Naïve execution

1st pass: transpose A (requires repartitioning of A)

2nd pass: multiply result with A(expensive, potentially requires repartitioning again)

• Logical optimization:

rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication

val C = A.t %*% A

Optimization Example

• Computation of ATA in example

• Naïve execution

1st pass: transpose A (requires repartitioning of A)

2nd pass: multiply result with A(expensive, potentially requires repartitioning again)

• Logical optimization:

rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication

val C = A.t %*% A

Transpose

A

Optimization Example

• Computation of ATA in example

• Naïve execution

1st pass: transpose A (requires repartitioning of A)

2nd pass: multiply result with A(expensive, potentially requires repartitioning again)

• Logical optimization:

rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication

val C = A.t %*% A

Transpose

MatrixMult

A A

C

Optimization Example

• Computation of ATA in example

• Naïve execution

1st pass: transpose A (requires repartitioning of A)

2nd pass: multiply result with A(expensive, potentially requires repartitioning again)

• Logical optimization

Optimizer rewrites plan to use specialized logical operator for Transpose-Times-Self matrix multiplication

val C = A.t %*% A

Transpose

MatrixMult

A A

C

Transpose-Times-Self

A

C

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

A

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x

A AT

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x = x

A AT a1• a1•T

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x = x + x

A AT a1• a1•T a2• a2•

T

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x = x + +x x

A AT a1• a1•T a2• a2•

T a3• a3•T

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x = x + + +x x x

A AT a1• a1•T a2• a2•

T a3• a3•T a4• a4•

T

Physical operators for Transpose-Times-Self

• Two physical operators (concrete implementations) available for Transpose-Times-Self operation

– standard operator AtA

– operator AtA_slim, specialized implementation for tall & skinny matrices

• Optimizer must choose – currently: depends on user-defined

threshold for number of columns

– ideally: cost based decision, dependent on estimates of intermediate result sizes

Transpose-Times-Self

A

C

Physical operators for the distributed computation of ATA

Physical operator AtA

1100

0101

0111

A

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

for 1st partition

for 1st partition

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

01010

1

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

01010

1

for 2nd partition

for 2nd partition

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

01010

1

01110

1

for 2nd partition

11001

1

for 2nd partition

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

01010

1

01110

1

for 2nd partition

01010

1

11001

1

for 2nd partition

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

0111

0111

0000

0000

for 1st partition

for 1st partition

0000

0101

0000

0111

for 2nd partition

0000

0101

1100

1100

for 2nd partition

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

0111

0111

0000

0000

for 1st partition

for 1st partition

0000

0101

0000

0111

for 2nd partition

0000

0101

1100

1100

for 2nd partition

0111

0212

worker 3

1100

1312

worker 4

ATA

Physical operator AtA_slim

1100

0101

0111

A

A2

1100

Physical operator AtA_slim

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

A2TA2A2

1100

1

11

000

0000

Physical operator AtA_slim

1100

0101

0111

A1TA1A1

A

worker 1

worker 2

0101

0111

0

02

011

0212

A2TA2A2

1100

1

11

000

0000

Physical operator AtA_slim

1100

0101

0111

A1TA1A1

A C = ATA

worker 1

worker 2

A1TA1 + A2

TA2

driver

0101

0111

0

02

011

0212

1100

1312

0111

0212