Hadoop France meetup Feb2016 : recommendations with spark

40
Recommendations with Spark

Transcript of Hadoop France meetup Feb2016 : recommendations with spark

Page 1: Hadoop France meetup  Feb2016 : recommendations with spark

Recommendations with Spark

Page 2: Hadoop France meetup  Feb2016 : recommendations with spark

Hi! I’m Koby

2

▣ Data Scientist at Equancy□ Previously: Kpler, Engie

▣ Python Dev□ scikit-learn / pandas / Jupyter□ Sometimes I use R

▣ I used Hadoop before for data pipelines

▣ My first project doing distributed ML!

Page 3: Hadoop France meetup  Feb2016 : recommendations with spark

Hello, my name is Hervé!

3

▣ Equancy Partner & Chief Scientist

▣ In charge with Data Technologies□ Data Engineering□ Data Science□ Innovating with data

▣ PhD in Machine Learning many years ago

Page 4: Hadoop France meetup  Feb2016 : recommendations with spark

4

Page 5: Hadoop France meetup  Feb2016 : recommendations with spark

Recommender Systems

Page 6: Hadoop France meetup  Feb2016 : recommendations with spark

Recommenders: What for?

6

▣ Only one occasion to interact with customers□ Which marketing message to choose?

▣ Personalized User Experience□ Improved Experience!

▣ No information overload□ ~230,000 Products

Page 7: Hadoop France meetup  Feb2016 : recommendations with spark

Why personalization matters?Because no personalization is ugly...

7

Page 8: Hadoop France meetup  Feb2016 : recommendations with spark

Recommendation algorithms

8

Page 9: Hadoop France meetup  Feb2016 : recommendations with spark

Three different recommendation systems

9

Homepage Product Page Cart

Collaborative Filtering (Unsupervised Learning)

Frequently Bought-Together Prediction

(Supervised Learning)

Content-Based Filtering(Correlation Maximization)

Page 10: Hadoop France meetup  Feb2016 : recommendations with spark

Three different recommendation systems

10

Homepage Product Page Cart

Collaborative Filtering (Unsupervised Learning)

Frequently Bought-Together Prediction

(Supervised Learning)

Content-Based Filtering(Correlation Maximization)

Page 11: Hadoop France meetup  Feb2016 : recommendations with spark

Business Rules

Page 12: Hadoop France meetup  Feb2016 : recommendations with spark

Business Inputs

▣ Score should be based on three factors:

□ Interaction type - purchase is more important

than a product view

□ Time (decay) - a product purchased in recent

history witll have more impact than a product

purchased in the distant past

□ Season - a product purchased during another

season will have less impact

Page 13: Hadoop France meetup  Feb2016 : recommendations with spark

Business Rules

▣ The following items should be Filtered-out:

□ Purchased recently or very similar

□ Not in current season

□ Not user’s gender

□ Not in stock

Page 14: Hadoop France meetup  Feb2016 : recommendations with spark

Collaborative Filtering

Page 15: Hadoop France meetup  Feb2016 : recommendations with spark

1

5 1 3

1

1 1

3 1

1 5

5 3

► 1

► 3

► 5

▣ Map users to products in a matrix

Page 16: Hadoop France meetup  Feb2016 : recommendations with spark

? ? ? 1 ? ?

5 1 3 ? ? ?

? ? ? 1 ? ?

? ? ? ? ? ?

1 1 ? ? ? ?

? ? 3 ? ? 1

? ? 1 ? 5 ?

? 5 ? 3 ? ?

► 1

► 3

► 5

▣ Predict missing interactions

Page 17: Hadoop France meetup  Feb2016 : recommendations with spark

?

?

?

?

?

?

?

?

Training

=

? ? ?

? ? ?

? ? ?

? ? ?

? ? ?

? ? ?

? ? ?

? ? ?

1

5 1 3

1

1 1

3 1

1 5

5 3

? ? ? ? ? ?X

Items

Users

? ? ? ? ? ?

? ? ? ? ? ?

? ? ? ? ? ?

Latent Factors

Matrix Factorization

1

5 1 3

1

3 1

1 1

3 1

1 5

5 3

Page 18: Hadoop France meetup  Feb2016 : recommendations with spark

1

5 1 3

1

3 1

1 1

3 1

1 5

5 3

Training

Matrix Factorization

▣ Input:

□ Sparse representation of matrix (tuples)

□ Representation of an interaction score

between user and product

Page 19: Hadoop France meetup  Feb2016 : recommendations with spark

Training

Matrix Factorization

▣ Output:

□ User Features

mapping users to latent features

□ Product Features

Mapping products to latent features

□ Estimation of interaction scores

?

?

?

?

?

?

?

?

? ? ?

? ? ?

? ? ?

? ? ?

? ? ?

? ? ?

? ? ?

? ? ?

1

5 1 3

1

1 1

3 1

1 5

5 3

? ? ? ? ? ?X

Items

Users

? ? ? ? ? ?

? ? ? ? ? ?

? ? ? ? ? ?

Latent Factors

Page 20: Hadoop France meetup  Feb2016 : recommendations with spark

Alternating Least Squares (ALS)

Page 21: Hadoop France meetup  Feb2016 : recommendations with spark

Implicit Collaborative Filtering

Page 22: Hadoop France meetup  Feb2016 : recommendations with spark

Implicit Collaborative Filtering

▣ Difficulties:□ How to interprate missing relations between

users and products?If a user didn’t click on the item - does it means

that the user doesn’t like it?Maybe he just didn’t see it yet?

□ What values should we use for missing relations?should we replace with 0?should we replace with mean/median?

▣ Using methods for explicit feedback (i.e. product rating) can’t be applied to our case!

Page 23: Hadoop France meetup  Feb2016 : recommendations with spark

▣ Spark MLlib has a special CF implementation for the implicit feedback case, based on:

▣ The general idea is using confidence level that will let us tune what a lack of feedback means for our applications

Implicit Collaborative Filtering

(Google the title to read it for free on the author’s page)

Page 24: Hadoop France meetup  Feb2016 : recommendations with spark

Implementation in Spark

Page 25: Hadoop France meetup  Feb2016 : recommendations with spark

Training

def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, blocks: Int, alpha:

Double, seed: Long): MatrixFactorizationModel

Train a matrix factorization model given an RDD of 'implicit preferences' given by users to some products, in the

form of (userID, productID, preference) pairs. We approximate the ratings matrix as the product of two lower-rank

matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of

ALS. This is done using a level of parallelism given by blocks.

ratings

RDD of (userID, productID, rating) pairs

rank

number of features to use

iterations

number of iterations of ALS (recommended: 10-20)

lambda

regularization factor (recommended: 0.01)

blocks

level of parallelism to split computation into

alpha

confidence parameter

seed

random seed

Page 26: Hadoop France meetup  Feb2016 : recommendations with spark

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# Load and parse the data

data = sc.textFile("data/mllib/als/test.data")

ratings = data.map(lambda l: l.split(','))\

.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squares

rank = 10

numIterations = 10

model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training data

testdata = ratings.map(lambda p: (p[0], p[1]))

predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))

ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)

MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()

print("Mean Squared Error = " + str(MSE))

# Save and load model

model.save(sc, "target/tmp/myCollaborativeFilter")

sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")

ALS for Python

Page 27: Hadoop France meetup  Feb2016 : recommendations with spark

import org.apache.spark.mllib.recommendation.ALS

import org.apache.spark.mllib.recommendation.MatrixFactorizationModel

import org.apache.spark.mllib.recommendation.Rating

// Load and parse the data

val data = sc.textFile("data/mllib/als/test.data")

val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>

Rating(user.toInt, item.toInt, rate.toDouble)

})

// Build the recommendation model using ALS

val rank = 10

val numIterations = 10

val model = ALS.train(ratings, rank, numIterations, 0.01)

// Evaluate the model on rating data

val usersProducts = ratings.map { case Rating(user, product, rate) =>

(user, product)

}

val predictions =

model.predict(usersProducts).map { case Rating(user, product, rate) =>

((user, product), rate)

}

val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>

((user, product), rate)

}.join(predictions)

val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>

val err = (r1 - r2)

err * err

}.mean()

println("Mean Squared Error = " + MSE)

// Save and load model

model.save(sc, "target/tmp/myCollaborativeFilter")

val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")

ALS for Scala

Page 28: Hadoop France meetup  Feb2016 : recommendations with spark

Validation and Parameter Tuning

Page 29: Hadoop France meetup  Feb2016 : recommendations with spark

Measuring prediction Performance

▣ In order to select good parameters for our model we designed a validation benchmark

▣ We based it on relatively small dataset to be able to make a significant amount of tests

▣ We chose to measure and minimize the RMSE*:□ used by default in ALS□ punishes big errors□ error is in the scale of the rating unit□ common metric for CF

* RMSE - Root Mean Square Error

Page 30: Hadoop France meetup  Feb2016 : recommendations with spark

Measuring prediction Performance

# splitting dataset randomly into train set and validation set

training_RDD, validation_RDD = small_ratings_data.randomSplit([0.7, 0.3], seed=0)

validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))

test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

ranks = [4, 8, 12]

errors = [0, 0, 0]

err = 0

# measuring error on the validation set

min_error = float('inf')

best_rank = -1

best_iteration = -1

for rank in ranks:

model = ALS.train(training_RDD, rank, seed=0, iterations=10, lambda=0.1)

predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))

rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)

error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())

errors[err] = error

err += 1

print 'For rank %s the RMSE is %s' % (rank, error)

if error < min_error:

min_error = error

best_rank = rank

print 'The best model was trained with rank %s' % best_rank

For rank 4 the RMSE is 0.963681878574

For rank 8 the RMSE is 0.96250475933

For rank 12 the RMSE is 0.971647563632

The best model was trained with rank 8

Page 31: Hadoop France meetup  Feb2016 : recommendations with spark

Deployment

Page 32: Hadoop France meetup  Feb2016 : recommendations with spark

Deployment

▣ Training a model is actually pretty fast▣ Deploying is slow

□ We decided that all users will get a top-n recommendation

□ This recommendation is stored in a DB▣ We need to make a fresh recommendation for every

user - there are 4 million users. In Python:def recommendProducts(self, user, num):

"""Recommends the top "num" number of products for a given user and returns a listof Rating objects sorted by the predicted rating in descending order."""return list(self.call("recommendProducts", user, num))

▣ This call was around 20 ms - pretty quick□ calling this function 4M times = 1 day

Page 33: Hadoop France meetup  Feb2016 : recommendations with spark

Deployment

▣ I wasn’t the only one that needed this feature ...

Page 34: Hadoop France meetup  Feb2016 : recommendations with spark

Deployment

▣ Solutions: Extracting the User / Product features and applying matrix multiplication and sorting directly the RDD by batches:

users_rdd = model.userFeatures()

products_rdd = model.productFeatures()

from joblib import Parallel, delayed

Parallel(n_jobs=cores, verbose=1000)(delayed(prepare_recommendation)(user_features_batch, gender)

for user_features_batch in nested_user_features)

...

user_features_batch.dot(product_features_T)

This was about 10 times faster than calling recommendProducts

▣ Starting from Spark 1.6 recommendProductsForUsers is implemented for Python

□ This where a Scala has advantage over Python!

Page 35: Hadoop France meetup  Feb2016 : recommendations with spark

Discussing Collaborative Filtering

Page 36: Hadoop France meetup  Feb2016 : recommendations with spark

Domain-specific discussion

▣ Pros

□ Helps us to find non-obvious relations between users and products

□ High diversity and coverage of item catalogue

□ Using an unsupervised method we project to a low-dimensional space:

Latent Factor 1 = 20% red boots + 30% green snickers + …

Latent Factor 2 = 15% adidas snickers + 35% comfy boots + ...

➔ Embodies “deep” preferences (fashion, style, ...)

▣ Cons

□ Unpredictable results:

e.g. user never shopped for red boots - why is it recommended?

□ Can be interpreted as intrusion to the users’ privacy through (a machine

analysing “deep” human desires…)

Page 37: Hadoop France meetup  Feb2016 : recommendations with spark

Machine Learning / Big Data discussion

▣ Pros

□ Training of the model is quick thanks to the latent feature low dimensionality

□ Linear model with a closed-form solution (“easy!”)

□ No cold-start problem (vs. User-based CF)

□ Training is parallelizable: Hadoop Friendly

▣ Cons

□ Heavy in computation in comparison to Content-Based approaches

□ Unable to fit non-linear relations (polynomial tricks can’t be applied)

Page 38: Hadoop France meetup  Feb2016 : recommendations with spark

Guess what?

39

We hire!

Data Engineers warmly welcomed

Page 39: Hadoop France meetup  Feb2016 : recommendations with spark

QUESTIONS & ANSWERS

Page 40: Hadoop France meetup  Feb2016 : recommendations with spark

Thank You!

www.equancy.com

47 rue de Chaillot - 75116 Paris

Koby KarpHervé Mignot

[email protected]@equancy.com