Introduction to Recommender Systems€¦ · Web Personalization & Recommender Systems I Most of...

Introduction to Recommender Systems

Fabio Petroni

About me

Fabio Petroni

Sapienza University of Rome, ItalyCurrent position:

PhD Student in Engineering in Computer ScienceResearch Interests:

data mining, machine learning, big [email protected]

I slides available athttp://www.fabiopetroni.com/teaching

2 of 65

[email protected]

http://www.fabiopetroni.com/teaching

Materials

IXavier Amatriain Lecture at Machine Learning Summer

School 2014, Carnegie Mellon UniversityB

https://youtu.be/bLhq63ygoU8

Bhttps://youtu.be/mRToFXlNBpQ

IRecommender Systems course by Rahul Sami at Michigan’sOpen UniversityB

http://open.umich.edu/education/si/si583/winter2009

IData Mining and Matrices Course by Rainer Gemulla atUniversity of MannheimB

http://dws.informatik.uni-mannheim.de/en/teaching/courses-

for-master-candidates/ie-673-data-mining-and-matrices/

3 of 65

Age of discovery

Xavier Amatriain – July 2014 – Recommender Systems

The Age of Search has come to an end

• ... long live the Age of Recommendation!• Chris Anderson in “The Long Tail”

• “We are leaving the age of information and entering the age of recommendation”

• CNN Money, “The race to create a 'smart' Google”:• “The Web, they say, is leaving the era of search and

entering one of discovery. What's the difference? Search is what you do when you're looking for something. Discovery is when something wonderful that you didn't know existed, or didn't know how to ask for, finds you.”

4 of 65

Web Personalization & Recommender Systems

I Most of todays internet businesses deeply root their successin the ability to provide users with strongly personalizedexperiences.

I Recommender Systems are a particular type of personalizedWeb-based applications that provide to users personalizedrecommendations about content they may be interested in.

5 of 65

Example 1

6 of 65

Example 2

Example: Amazon Recommendations

http://www.amazon.com/

7 of 65

Example 3

8 of 65

The tyranny of choice


Information overload

“People read around 10 MB worth of material a day, hear 400 MB a day, and see 1 MB of information every second” - The Economist, November 2006

In 2015, consumption will raise to 74 GB a day - UCSD Study 2014

9 of 65


The value of recommendations• Netflix: 2/3 of the movies watched are

recommended

• Google News: recommendations generate 38% more clickthrough

• Amazon: 35% sales from recommendations

• Choicestream: 28% of the people would buy more music if they found what they liked.

u

10 of 65

Recommendation process

users

items

feedback

11 of 65

Input

Sources of information

• Explicit ratings on a numeric/ 5-star/3-star etc. scale

• Explicit binary ratings (thumbs up/thumbs down)

• Implicit information, e.g.,– who bookmarked/linked to the item?

– how many times was it viewed?

– how many units were sold?

– how long did users read the page?

• Item descriptions/features

• User profiles/preferences

12 of 65

Methods of a aggregating inputs

IContent-based filtering

Brecommendations based on item descriptions/features, and

profile or past behavior of the “target” user only.

ICollaborative filtering

Blook at the ratings of like-minded users to provide

recommendations, with the idea that users who have expressed

similar interests in the past will share common interests in the

future.

13 of 65

Collaborative Filtering

I Collaborative Filtering (CF) represents today’s a widelyadopted strategy to build recommendation engines.

I CF analyzes the known preferences of a group of users tomake predictions of the unknown preferences for other users.

14 of 65

Collaborative filtering

I problemB

set of users

Bset of items (movies, books, songs, ...)

Bfeedback

I explicit (ratings, ...)I implicit (purchase, click-through, ...)

I predict the preference of each user for each itemB

assumption: similar feedback $ similar taste

I example (explicit feedback):

Avatar The Matrix UpMarco 4 2Luca 3 2Anna 5 3

15 of 65

Collaborative filtering

I problemB

set of users

Bset of items (movies, books, songs, ...)

Bfeedback

I explicit (ratings, ...)I implicit (purchase, click-through, ...)

I predict the preference of each user for each itemB

assumption: similar feedback $ similar taste

I example (explicit feedback):

Avatar The Matrix UpMarco ? 4 2Luca 3 2 ?

Anna 5 ? 3

15 of 65

Collaborative filtering taxonomy

SVD PMFuser based PLS(A/I)

memory based

collaborative filtering

item based

model based

probabilistic methods

neighborhood models

dimensionality reduction

matrix completion

latent Dirichlet allocation

other machine learning methods

Bayesian networks

Markov decision processes

neural networks

IMemory-based use the ratings to compute similaritiesbetween users or items (the “memory" of the system) that aresuccessively exploited to produce recommendations.

IModel-based use the ratings to estimate or learn a modeland then apply this model to make rating predictions.

16 of 65

Memory based

neighborhood models

17 of 65


The CF Ingredients

● List of m Users and a list of n Items● Each user has a list of items with associated opinion

○ Explicit opinion - a rating score ○ Sometime the rating is implicitly – purchase records

or listen to tracks● Active user for whom the CF prediction task is

performed● Metric for measuring similarity between users● Method for selecting a subset of neighbors ● Method for predicting a rating for items not currently

rated by the active user.

18 of 65


Collaborative Filtering

The basic steps:

1. Identify set of ratings for the target/active user2. Identify set of users most similar to the target/active user

according to a similarity function (neighborhood formation)

3. Identify the products these similar users liked4. Generate a prediction - rating that would be given by the

target user to the product - for each one of these products 5. Based on this predicted rating recommend a set of top N

products

19 of 65


User-based Collaborative Filtering

20 of 65


User-User Collaborative Filtering

Target User

Weighted Sum

21 of 65


UB Collaborative Filtering ● A collection of user ui, i=1, …n and a collection

of products pj, j=1, …, m● An n × m matrix of ratings vij , with vij = ? if user

i did not rate product j● Prediction for user i and product j is computed

as

• Similarity can be computed by Pearson correlation

or

or

22 of 65

23 of 65

24 of 65

25 of 65

26 of 65

27 of 65


Item-based Collaborative Filtering

28 of 65


Item-Item Collaborative Filtering

29 of 65


Item Based CF Algorithm● Look into the items the target user has rated ● Compute how similar they are to the target item

○ Similarity only using past ratings from other users!

● Select k most similar items.● Compute Prediction by taking weighted average

on the target user’s ratings on the most similar items.

30 of 65


Item Similarity Computation

● Similarity between items i & j computed by finding users who have rated them and then applying a similarity function to their ratings.

● Cosine-based Similarity – items are vectors in the m dimensional user space (difference in rating scale between users is not taken into account).

31 of 65


Prediction Computation

● Generating the prediction – look into the target users ratings and use techniques to obtain predictions.

● Weighted Sum – how the active user rates the similar items.

32 of 65


Item-based CF Example

33 of 65



34 of 65



35 of 65



36 of 65



37 of 65



38 of 65


Performance Implications● Bottleneck - Similarity computation.● Time complexity, highly time consuming with

millions of users and items in the database.○ Isolate the neighborhood generation and

predication steps.○ “off-line component” / “model” – similarity

computation, done earlier & stored in memory.○ “on-line component” – prediction generation

process.

39 of 65


Challenges Of User-based CF Algorithms

● Sparsity – evaluation of large item sets, users purchases are under 1%.

● Difficult to make predictions based on nearest neighbor algorithms =>Accuracy of recommendation may be poor.

● Scalability - Nearest neighbor require computation that grows with both the number of users and the number of items.

● Poor relationship among like minded but sparse-rating users.

● Solution : usage of latent models to capture similarity between users & items in a reduced dimensional space.

40 of 65

Model based

dimensionality reduction

41 of 65


What we were interested in:■ High quality recommendations

Proxy question:■ Accuracy in predicted rating ■ Improve by 10% = $1million!

42 of 65

43 of 65


SVD/MF

X[n x m] = U[n x r] S [ r x r] (V[m x r])T

● X: m x n matrix (e.g., m users, n videos)● U: m x r matrix (m users, r factors)● S: r x r diagonal matrix (strength of each ‘factor’) (r: rank of the

matrix)● V: r x n matrix (n videos, r factor)

44 of 65

Recap: Singular Value Decomposition

• SVD is useful in data analysis� Noise removal, visualization, dimensionality reduction, . . .

• Provides a mean to understand the hidden structure in the data

We may think of Ak and its factor matrices as a low-rank modelof the data:

• Used to capture the important aspects of the data(cf. principal components)

• Ignores the rest

• Truncated SVD is best low-rank factorization of the data interms of Frobenius norm

• Truncated SVD Ak = Uk�kV Tk of A thus satisfies

�A � Ak�F = minrank(B)=k

�A � B�F

2 / 4545 of 65

SVD problems

I complete input matrix: all entries available and considered

I large portion of missing values

I heuristics to pre-fill missing values

Bitem’s average rating

Bmissing values as zeros

46 of 65

Matrix completion

IMatrix completion techniques avoid the necessity ofpre-filling missing entries by reasoning only on the observedratings.

I They can be seen as an estimate or an approximation of theSVD, computed using application specific optimizationcriteria.

I Such solutions are currently considered as the bestsingle-model approach to collaborative filtering, asdemonstrated, for instance, by the Netflix prize.

47 of 65

Matrix completion for collaborative filtering

I the completion is driven by a factorization

R P Q

I associate a latent factor vector with each user and each item

I missing entries are estimated through the dot product

rij ⇡ piqj

48 of 65

Latent factor models (Koren et al., 2009)

49 of 65

Latent factor models

� Discover latent factors (r = 1)

Avatar The Matrix Up

(2.24) (1.92) (1.18)

Anni

?

4 2

(1.98) (4.4) (3.8) (2.3)

Bob 3 2

?(1.21) (2.7) (2.3) (1.4)

Charlie 5

?

3

(2.30) (5.2) (4.4) (2.7)

� Minimum loss� Bias

, regularization, time, . . .

6 / 4250 of 65


� Discover latent factors (r = 1)

Avatar The Matrix Up(2.24) (1.92) (1.18)

Anni

?

4 2(1.98)

(4.4) (3.8) (2.3)

Bob 3 2

?

(1.21)

(2.7) (2.3) (1.4)

Charlie 5

?

3(2.30)

(5.2) (4.4) (2.7)

� Minimum loss� Bias


6 / 4251 of 65


� Discover latent factors (r = 1)Avatar The Matrix Up(2.24) (1.92) (1.18)

Anni

?

4 2(1.98)

(4.4)

(3.8) (2.3)

Bob 3 2

?

(1.21) (2.7) (2.3)

(1.4)

Charlie 5

?

3(2.30) (5.2)

(4.4)

(2.7)

� Minimum loss

minQ,P

�

(i ,j)��

(vij � [QTP]ij)2

� Bias


6 / 4252 of 65



Anni ? 4 2(1.98) (4.4) (3.8) (2.3)

Bob 3 2 ?(1.21) (2.7) (2.3) (1.4)

Charlie 5 ? 3(2.30) (5.2) (4.4) (2.7)

� Minimum loss

minQ,P

�

(i ,j)��

(vij � [QTP]ij)2

� Bias


6 / 4253 of 65



Anni ? 4 2(1.98) (4.4) (3.8) (2.3)

Bob 3 2 ?(1.21) (2.7) (2.3) (1.4)

Charlie 5 ? 3(2.30) (5.2) (4.4) (2.7)

� Minimum loss

minQ,P,u,m

�

(i ,j)��

(vij � µ � ui � mj � [QTP]ij)2

� Bias


6 / 4254 of 65



Anni ? 4 2(1.98) (4.4) (3.8) (2.3)

Bob 3 2 ?(1.21) (2.7) (2.3) (1.4)

Charlie 5 ? 3(2.30) (5.2) (4.4) (2.7)

� Minimum loss

minQ,P,u,m

�

(i ,j)��

(vij � µ � ui � mj � [QTP]ij)2

+ � (�Q� + �P� + �u� + �m�)

� Bias, regularization

, time, . . .

6 / 4255 of 65



Anni ? 4 2(1.98) (4.4) (3.8) (2.3)

Bob 3 2 ?(1.21) (2.7) (2.3) (1.4)

Charlie 5 ? 3(2.30) (5.2) (4.4) (2.7)

� Minimum loss

minQ,P,u,m

�

(i ,j ,t)��t

(vij � µ � ui (t) � mj(t) � [QT (t)P]ij)2

+ � (�Q(t)� + �P� + �u(t)� + �m(t)�)

� Bias, regularization, time, . . .6 / 4256 of 65

Example: Netflix prize data

Root mean square error of predictionsCOVER FE ATURE

COMPUTER 48

M atrix factoriza-tion techniques have become a dominant meth-odology within

collaborative filtering recom-menders. Experience with datasets such as the Netflix Prize data has shown that they deliver accuracy superior to classical nearest-neighbor techniques. At the same time, they offer a com-pact memory-efficient model that systems can learn relatively easily. What makes these tech-niques even more convenient is that models can integrate natu-rally many crucial aspects of the data, such as multiple forms of feedback, temporal dynamics, and confidence levels.

References 1. D. Goldberg et al., “Using Col- laborative Filtering to Weave an Information Tapestry,” Comm. ACM, vol. 35, 1992, pp. 61-70.

2. B.M. Sarwar et al., “Application of Dimensionality Reduc-tion in Recommender System—A Case Study,” Proc. KDD Workshop on Web Mining for e-Commerce: Challenges and Opportunities (WebKDD), ACM Press, 2000.

3. S. Funk, “Netflix Update: Try This at Home,” Dec. 2006; http://sifter.org/~simon/journal/20061211.html.

4. Y. Koren, “Factorization Meets the Neighborhood: A Mul-tifaceted Collaborative Filtering Model,” Proc. 14th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, ACM Press, 2008, pp. 426-434.

5. A. Paterek, “Improving Regularized Singular Value De-composition for Collaborative Filtering,” Proc. KDD Cup and Workshop, ACM Press, 2007, pp. 39-42.

6. G. Takács et al., “Major Components of the Gravity Recom-mendation System,” SIGKDD Explorations, vol. 9, 2007, pp. 80-84.

7. R. Salakhutdinov and A. Mnih, “Probabilistic Matrix Fac-torization,” Proc. Advances in Neural Information Processing Systems 20 (NIPS 07), ACM Press, 2008, pp. 1257-1264.

8. R. Bell and Y. Koren, “Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights,” Proc. IEEE Int’l Conf. Data Mining (ICDM 07), IEEE CS Press, 2007, pp. 43-52.

9. Y. Zhou et al., “Large-Scale Parallel Collaborative Filter-ing for the Netflix Prize,” Proc. 4th Int’l Conf. Algorithmic Aspects in Information and Management, LNCS 5034, Springer, 2008, pp. 337-348.

10. Y.F. Hu, Y. Koren, and C. Volinsky, “Collaborative Filtering for Implicit Feedback Datasets,” Proc. IEEE Int’l Conf. Data Mining (ICDM 08), IEEE CS Press, 2008, pp. 263-272.

the mainstream crowd-pleasers, is The Sound of Music. And smack in the middle, appealing to all types, is The Wizard of Oz.

In this plot, some movies neighboring one another typi-cally would not be put together. For example, Annie Hall and Citizen Kane are next to each other. Although they are stylistically very different, they have a lot in common as highly regarded classic movies by famous directors. Indeed, the third dimension in the factorization does end up separating these two.

We tried many different implementations and pa-rameterizations for factorization. Figure 4 shows how different models and numbers of parameters affect the RMSE as well as the performance of the factorization’s evolving implementations—plain factorization, adding biases, enhancing user profile with implicit feedback, and two variants adding temporal components. The accuracy of each of the factor models improves by increasing the number of involved parameters, which is equivalent to increasing the factor model’s dimensionality, denoted by numbers on the charts.

The more complex factor models, whose descriptions involve more distinct sets of parameters, are more accu-rate. In fact, the temporal components are particularly important to model as there are significant temporal ef-fects in the data.

4060

90128

18050100 200

50

100200

100 200 500 50100 200 500 1,000

1,500

0.875

0.88

0.885

0.89

0.895

0.9

0.905

0.91

10 100 1,000 10,000 100,000Millions of parameters

RMSE

PlainWith biasesWith implicit feedbackWith temporal dynamics (v.1)With temporal dynamics (v.2)

Figure 4. Matrix factorization models’ accuracy. The plots show the root-mean-square error of each of four individual factor models (lower is better). Accuracy improves when the factor model’s dimensionality (denoted by numbers on the charts) increases. In addition, the more refined factor models, whose descriptions involve more distinct sets of parameters, are more accurate. For comparison, the Netflix system achieves RMSE = 0.9514 on the same dataset, while the grand prize’s required accuracy is RMSE = 0.8563.

17 / 45Koren et al., 2009.57 of 65

Another matrix

7 / 4258 of 65

Matrix reconstruction (unregularized)

8 / 4259 of 65


8 / 4260 of 65


8 / 4261 of 65


8 / 4262 of 65

Stochastic gradient descent

I parameters ⇥ = {P ,Q}I find minimum ⇥⇤ of loss

function L

I pick a starting point ⇥0

I iteratively update currentestimations for ⇥

6

7

0 5 10 15 20 25 30

loss

(× 1

07)

iterations

⇥n+1 ⇥n � ⌘@L

@⇥I learning rate ⌘

I an update for each given training point

63 of 65

Stochastic updates

Lij(P ,Q) = (rij � piqj)2

I SGD to minimize the squared loss iteratively computes:

pi pi � ⌘@Lij(P ,Q)

@pi= pi + ⌘("ij · qj)

qj qj � ⌘@Lij(P ,Q)

@qj= qj + ⌘("ij · pi)

I where "ij = rij � piqj

64 of 65

Suggested reading

I G. Linden, B. Smith, and J. York. Amazon.com recommendations:Item-to-item collaborative filtering. Internet Computing, IEEE,7(1):76–80, 2003.

I Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques

for recommender systems. Computer, 42(8):30–37, 2009.I X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering

techniques. Advances in Artificial Intelligence, 2009:4, 2009.I F. Ricci, L. Rokach, and B. Shapira. Introduction to recommender

systems handbook. Springer, 2011.I M. D. Ekstrand, J. T. Riedl, and J. A. Konstan. Collaborative filtering

recommender systems. Foundations and Trends in Human-ComputerInteraction, 4(2):81–173, 2011.

I J. A. Konstan and J. Riedl. Recommender systems: from algorithms touser experience. User Modeling and User-Adapted Interaction,22(1-2):101–123, 2012.

65 of 65

Introduction to Recommender Systems€¦ · Web Personalization & Recommender Systems I Most of...

Documents

Transcript of Introduction to Recommender Systems€¦ · Web Personalization & Recommender Systems I Most of...