Introduction to Recommender Systems€¦ · Web Personalization & Recommender Systems I Most of...
Transcript of Introduction to Recommender Systems€¦ · Web Personalization & Recommender Systems I Most of...
Introduction to Recommender Systems
Fabio Petroni
About me
Fabio Petroni
Sapienza University of Rome, ItalyCurrent position:
PhD Student in Engineering in Computer ScienceResearch Interests:
data mining, machine learning, big [email protected]
I slides available athttp://www.fabiopetroni.com/teaching
2 of 65
Materials
IXavier Amatriain Lecture at Machine Learning Summer
School 2014, Carnegie Mellon UniversityB
https://youtu.be/bLhq63ygoU8
Bhttps://youtu.be/mRToFXlNBpQ
IRecommender Systems course by Rahul Sami at Michigan’sOpen UniversityB
http://open.umich.edu/education/si/si583/winter2009
IData Mining and Matrices Course by Rainer Gemulla atUniversity of MannheimB
http://dws.informatik.uni-mannheim.de/en/teaching/courses-
for-master-candidates/ie-673-data-mining-and-matrices/
3 of 65
Age of discovery
Xavier Amatriain – July 2014 – Recommender Systems
The Age of Search has come to an end
• ... long live the Age of Recommendation!• Chris Anderson in “The Long Tail”
• “We are leaving the age of information and entering the age of recommendation”
• CNN Money, “The race to create a 'smart' Google”:• “The Web, they say, is leaving the era of search and
entering one of discovery. What's the difference? Search is what you do when you're looking for something. Discovery is when something wonderful that you didn't know existed, or didn't know how to ask for, finds you.”
4 of 65
Web Personalization & Recommender Systems
I Most of todays internet businesses deeply root their successin the ability to provide users with strongly personalizedexperiences.
I Recommender Systems are a particular type of personalizedWeb-based applications that provide to users personalizedrecommendations about content they may be interested in.
5 of 65
Example 1
6 of 65
Example 2
Example: Amazon Recommendations
http://www.amazon.com/
7 of 65
Example 3
8 of 65
The tyranny of choice
Xavier Amatriain – July 2014 – Recommender Systems
Information overload
“People read around 10 MB worth of material a day, hear 400 MB a day, and see 1 MB of information every second” - The Economist, November 2006
In 2015, consumption will raise to 74 GB a day - UCSD Study 2014
9 of 65
Xavier Amatriain – July 2014 – Recommender Systems
The value of recommendations• Netflix: 2/3 of the movies watched are
recommended
• Google News: recommendations generate 38% more clickthrough
• Amazon: 35% sales from recommendations
• Choicestream: 28% of the people would buy more music if they found what they liked.
u
10 of 65
Recommendation process
users
items
feedback
11 of 65
Input
Sources of information
• Explicit ratings on a numeric/ 5-star/3-star etc. scale
• Explicit binary ratings (thumbs up/thumbs down)
• Implicit information, e.g.,– who bookmarked/linked to the item?
– how many times was it viewed?
– how many units were sold?
– how long did users read the page?
• Item descriptions/features
• User profiles/preferences
12 of 65
Methods of a aggregating inputs
IContent-based filtering
Brecommendations based on item descriptions/features, and
profile or past behavior of the “target” user only.
ICollaborative filtering
Blook at the ratings of like-minded users to provide
recommendations, with the idea that users who have expressed
similar interests in the past will share common interests in the
future.
13 of 65
Collaborative Filtering
I Collaborative Filtering (CF) represents today’s a widelyadopted strategy to build recommendation engines.
I CF analyzes the known preferences of a group of users tomake predictions of the unknown preferences for other users.
14 of 65
Collaborative filtering
I problemB
set of users
Bset of items (movies, books, songs, ...)
Bfeedback
I explicit (ratings, ...)I implicit (purchase, click-through, ...)
I predict the preference of each user for each itemB
assumption: similar feedback $ similar taste
I example (explicit feedback):
Avatar The Matrix UpMarco 4 2Luca 3 2Anna 5 3
15 of 65
Collaborative filtering
I problemB
set of users
Bset of items (movies, books, songs, ...)
Bfeedback
I explicit (ratings, ...)I implicit (purchase, click-through, ...)
I predict the preference of each user for each itemB
assumption: similar feedback $ similar taste
I example (explicit feedback):
Avatar The Matrix UpMarco ? 4 2Luca 3 2 ?
Anna 5 ? 3
15 of 65
Collaborative filtering taxonomy
SVD PMFuser based PLS(A/I)
memory based
collaborative filtering
item based
model based
probabilistic methods
neighborhood models
dimensionality reduction
matrix completion
latent Dirichlet allocation
other machine learning methods
Bayesian networks
Markov decision processes
neural networks
IMemory-based use the ratings to compute similaritiesbetween users or items (the “memory" of the system) that aresuccessively exploited to produce recommendations.
IModel-based use the ratings to estimate or learn a modeland then apply this model to make rating predictions.
16 of 65
Memory based
neighborhood models
17 of 65
Xavier Amatriain – July 2014 – Recommender Systems
The CF Ingredients
● List of m Users and a list of n Items● Each user has a list of items with associated opinion
○ Explicit opinion - a rating score ○ Sometime the rating is implicitly – purchase records
or listen to tracks● Active user for whom the CF prediction task is
performed● Metric for measuring similarity between users● Method for selecting a subset of neighbors ● Method for predicting a rating for items not currently
rated by the active user.
18 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Collaborative Filtering
The basic steps:
1. Identify set of ratings for the target/active user2. Identify set of users most similar to the target/active user
according to a similarity function (neighborhood formation)
3. Identify the products these similar users liked4. Generate a prediction - rating that would be given by the
target user to the product - for each one of these products 5. Based on this predicted rating recommend a set of top N
products
19 of 65
Xavier Amatriain – July 2014 – Recommender Systems
User-based Collaborative Filtering
20 of 65
Xavier Amatriain – July 2014 – Recommender Systems
User-User Collaborative Filtering
Target User
Weighted Sum
21 of 65
Xavier Amatriain – July 2014 – Recommender Systems
UB Collaborative Filtering ● A collection of user ui, i=1, …n and a collection
of products pj, j=1, …, m● An n × m matrix of ratings vij , with vij = ? if user
i did not rate product j● Prediction for user i and product j is computed
as
• Similarity can be computed by Pearson correlation
or
or
22 of 65
23 of 65
24 of 65
25 of 65
26 of 65
27 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item-based Collaborative Filtering
28 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item-Item Collaborative Filtering
29 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item Based CF Algorithm● Look into the items the target user has rated ● Compute how similar they are to the target item
○ Similarity only using past ratings from other users!
● Select k most similar items.● Compute Prediction by taking weighted average
on the target user’s ratings on the most similar items.
30 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item Similarity Computation
● Similarity between items i & j computed by finding users who have rated them and then applying a similarity function to their ratings.
● Cosine-based Similarity – items are vectors in the m dimensional user space (difference in rating scale between users is not taken into account).
31 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Prediction Computation
● Generating the prediction – look into the target users ratings and use techniques to obtain predictions.
● Weighted Sum – how the active user rates the similar items.
32 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item-based CF Example
33 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item-based CF Example
34 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item-based CF Example
35 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item-based CF Example
36 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item-based CF Example
37 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Item-based CF Example
38 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Performance Implications● Bottleneck - Similarity computation.● Time complexity, highly time consuming with
millions of users and items in the database.○ Isolate the neighborhood generation and
predication steps.○ “off-line component” / “model” – similarity
computation, done earlier & stored in memory.○ “on-line component” – prediction generation
process.
39 of 65
Xavier Amatriain – July 2014 – Recommender Systems
Challenges Of User-based CF Algorithms
● Sparsity – evaluation of large item sets, users purchases are under 1%.
● Difficult to make predictions based on nearest neighbor algorithms =>Accuracy of recommendation may be poor.
● Scalability - Nearest neighbor require computation that grows with both the number of users and the number of items.
● Poor relationship among like minded but sparse-rating users.
● Solution : usage of latent models to capture similarity between users & items in a reduced dimensional space.
40 of 65
Model based
dimensionality reduction
41 of 65
Xavier Amatriain – July 2014 – Recommender Systems
What we were interested in:■ High quality recommendations
Proxy question:■ Accuracy in predicted rating ■ Improve by 10% = $1million!
42 of 65
43 of 65
Xavier Amatriain – July 2014 – Recommender Systems
SVD/MF
X[n x m] = U[n x r] S [ r x r] (V[m x r])T
● X: m x n matrix (e.g., m users, n videos)● U: m x r matrix (m users, r factors)● S: r x r diagonal matrix (strength of each ‘factor’) (r: rank of the
matrix)● V: r x n matrix (n videos, r factor)
44 of 65
Recap: Singular Value Decomposition
• SVD is useful in data analysis� Noise removal, visualization, dimensionality reduction, . . .
• Provides a mean to understand the hidden structure in the data
We may think of Ak and its factor matrices as a low-rank modelof the data:
• Used to capture the important aspects of the data(cf. principal components)
• Ignores the rest
• Truncated SVD is best low-rank factorization of the data interms of Frobenius norm
• Truncated SVD Ak = Uk�kV Tk of A thus satisfies
�A � Ak�F = minrank(B)=k
�A � B�F
2 / 4545 of 65
SVD problems
I complete input matrix: all entries available and considered
I large portion of missing values
I heuristics to pre-fill missing values
Bitem’s average rating
Bmissing values as zeros
46 of 65
Matrix completion
IMatrix completion techniques avoid the necessity ofpre-filling missing entries by reasoning only on the observedratings.
I They can be seen as an estimate or an approximation of theSVD, computed using application specific optimizationcriteria.
I Such solutions are currently considered as the bestsingle-model approach to collaborative filtering, asdemonstrated, for instance, by the Netflix prize.
47 of 65
Matrix completion for collaborative filtering
I the completion is driven by a factorization
R P Q
I associate a latent factor vector with each user and each item
I missing entries are estimated through the dot product
rij ⇡ piqj
48 of 65
Latent factor models (Koren et al., 2009)
49 of 65
Latent factor models
� Discover latent factors (r = 1)
Avatar The Matrix Up
(2.24) (1.92) (1.18)
Anni
?
4 2
(1.98) (4.4) (3.8) (2.3)
Bob 3 2
?(1.21) (2.7) (2.3) (1.4)
Charlie 5
?
3
(2.30) (5.2) (4.4) (2.7)
� Minimum loss� Bias
, regularization, time, . . .
6 / 4250 of 65
Latent factor models
� Discover latent factors (r = 1)
Avatar The Matrix Up(2.24) (1.92) (1.18)
Anni
?
4 2(1.98)
(4.4) (3.8) (2.3)
Bob 3 2
?
(1.21)
(2.7) (2.3) (1.4)
Charlie 5
?
3(2.30)
(5.2) (4.4) (2.7)
� Minimum loss� Bias
, regularization, time, . . .
6 / 4251 of 65
Latent factor models
� Discover latent factors (r = 1)Avatar The Matrix Up(2.24) (1.92) (1.18)
Anni
?
4 2(1.98)
(4.4)
(3.8) (2.3)
Bob 3 2
?
(1.21) (2.7) (2.3)
(1.4)
Charlie 5
?
3(2.30) (5.2)
(4.4)
(2.7)
� Minimum loss
minQ,P
�
(i ,j)��
(vij � [QTP]ij)2
� Bias
, regularization, time, . . .
6 / 4252 of 65
Latent factor models
� Discover latent factors (r = 1)Avatar The Matrix Up(2.24) (1.92) (1.18)
Anni ? 4 2(1.98) (4.4) (3.8) (2.3)
Bob 3 2 ?(1.21) (2.7) (2.3) (1.4)
Charlie 5 ? 3(2.30) (5.2) (4.4) (2.7)
� Minimum loss
minQ,P
�
(i ,j)��
(vij � [QTP]ij)2
� Bias
, regularization, time, . . .
6 / 4253 of 65
Latent factor models
� Discover latent factors (r = 1)Avatar The Matrix Up(2.24) (1.92) (1.18)
Anni ? 4 2(1.98) (4.4) (3.8) (2.3)
Bob 3 2 ?(1.21) (2.7) (2.3) (1.4)
Charlie 5 ? 3(2.30) (5.2) (4.4) (2.7)
� Minimum loss
minQ,P,u,m
�
(i ,j)��
(vij � µ � ui � mj � [QTP]ij)2
� Bias
, regularization, time, . . .
6 / 4254 of 65
Latent factor models
� Discover latent factors (r = 1)Avatar The Matrix Up(2.24) (1.92) (1.18)
Anni ? 4 2(1.98) (4.4) (3.8) (2.3)
Bob 3 2 ?(1.21) (2.7) (2.3) (1.4)
Charlie 5 ? 3(2.30) (5.2) (4.4) (2.7)
� Minimum loss
minQ,P,u,m
�
(i ,j)��
(vij � µ � ui � mj � [QTP]ij)2
+ � (�Q� + �P� + �u� + �m�)
� Bias, regularization
, time, . . .
6 / 4255 of 65
Latent factor models
� Discover latent factors (r = 1)Avatar The Matrix Up(2.24) (1.92) (1.18)
Anni ? 4 2(1.98) (4.4) (3.8) (2.3)
Bob 3 2 ?(1.21) (2.7) (2.3) (1.4)
Charlie 5 ? 3(2.30) (5.2) (4.4) (2.7)
� Minimum loss
minQ,P,u,m
�
(i ,j ,t)��t
(vij � µ � ui (t) � mj(t) � [QT (t)P]ij)2
+ � (�Q(t)� + �P� + �u(t)� + �m(t)�)
� Bias, regularization, time, . . .6 / 4256 of 65
Example: Netflix prize data
Root mean square error of predictionsCOVER FE ATURE
COMPUTER 48
M atrix factoriza-tion techniques have become a dominant meth-odology within
collaborative filtering recom-menders. Experience with datasets such as the Netflix Prize data has shown that they deliver accuracy superior to classical nearest-neighbor techniques. At the same time, they offer a com-pact memory-efficient model that systems can learn relatively easily. What makes these tech-niques even more convenient is that models can integrate natu-rally many crucial aspects of the data, such as multiple forms of feedback, temporal dynamics, and confidence levels.
References 1. D. Goldberg et al., “Using Col- laborative Filtering to Weave an Information Tapestry,” Comm. ACM, vol. 35, 1992, pp. 61-70.
2. B.M. Sarwar et al., “Application of Dimensionality Reduc-tion in Recommender System—A Case Study,” Proc. KDD Workshop on Web Mining for e-Commerce: Challenges and Opportunities (WebKDD), ACM Press, 2000.
3. S. Funk, “Netflix Update: Try This at Home,” Dec. 2006; http://sifter.org/~simon/journal/20061211.html.
4. Y. Koren, “Factorization Meets the Neighborhood: A Mul-tifaceted Collaborative Filtering Model,” Proc. 14th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, ACM Press, 2008, pp. 426-434.
5. A. Paterek, “Improving Regularized Singular Value De-composition for Collaborative Filtering,” Proc. KDD Cup and Workshop, ACM Press, 2007, pp. 39-42.
6. G. Takács et al., “Major Components of the Gravity Recom-mendation System,” SIGKDD Explorations, vol. 9, 2007, pp. 80-84.
7. R. Salakhutdinov and A. Mnih, “Probabilistic Matrix Fac-torization,” Proc. Advances in Neural Information Processing Systems 20 (NIPS 07), ACM Press, 2008, pp. 1257-1264.
8. R. Bell and Y. Koren, “Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights,” Proc. IEEE Int’l Conf. Data Mining (ICDM 07), IEEE CS Press, 2007, pp. 43-52.
9. Y. Zhou et al., “Large-Scale Parallel Collaborative Filter-ing for the Netflix Prize,” Proc. 4th Int’l Conf. Algorithmic Aspects in Information and Management, LNCS 5034, Springer, 2008, pp. 337-348.
10. Y.F. Hu, Y. Koren, and C. Volinsky, “Collaborative Filtering for Implicit Feedback Datasets,” Proc. IEEE Int’l Conf. Data Mining (ICDM 08), IEEE CS Press, 2008, pp. 263-272.
the mainstream crowd-pleasers, is The Sound of Music. And smack in the middle, appealing to all types, is The Wizard of Oz.
In this plot, some movies neighboring one another typi-cally would not be put together. For example, Annie Hall and Citizen Kane are next to each other. Although they are stylistically very different, they have a lot in common as highly regarded classic movies by famous directors. Indeed, the third dimension in the factorization does end up separating these two.
We tried many different implementations and pa-rameterizations for factorization. Figure 4 shows how different models and numbers of parameters affect the RMSE as well as the performance of the factorization’s evolving implementations—plain factorization, adding biases, enhancing user profile with implicit feedback, and two variants adding temporal components. The accuracy of each of the factor models improves by increasing the number of involved parameters, which is equivalent to increasing the factor model’s dimensionality, denoted by numbers on the charts.
The more complex factor models, whose descriptions involve more distinct sets of parameters, are more accu-rate. In fact, the temporal components are particularly important to model as there are significant temporal ef-fects in the data.
4060
90128
18050100 200
50
100200
100 200 500 50100 200 500 1,000
1,500
0.875
0.88
0.885
0.89
0.895
0.9
0.905
0.91
10 100 1,000 10,000 100,000Millions of parameters
RMSE
PlainWith biasesWith implicit feedbackWith temporal dynamics (v.1)With temporal dynamics (v.2)
Figure 4. Matrix factorization models’ accuracy. The plots show the root-mean-square error of each of four individual factor models (lower is better). Accuracy improves when the factor model’s dimensionality (denoted by numbers on the charts) increases. In addition, the more refined factor models, whose descriptions involve more distinct sets of parameters, are more accurate. For comparison, the Netflix system achieves RMSE = 0.9514 on the same dataset, while the grand prize’s required accuracy is RMSE = 0.8563.
17 / 45Koren et al., 2009.57 of 65
Another matrix
7 / 4258 of 65
Matrix reconstruction (unregularized)
8 / 4259 of 65
Matrix reconstruction (unregularized)
8 / 4260 of 65
Matrix reconstruction (unregularized)
8 / 4261 of 65
Matrix reconstruction (unregularized)
8 / 4262 of 65
Stochastic gradient descent
I parameters ⇥ = {P ,Q}I find minimum ⇥⇤ of loss
function L
I pick a starting point ⇥0
I iteratively update currentestimations for ⇥
6
7
0 5 10 15 20 25 30
loss
(× 1
07)
iterations
⇥n+1 ⇥n � ⌘@L
@⇥I learning rate ⌘
I an update for each given training point
63 of 65
Stochastic updates
Lij(P ,Q) = (rij � piqj)2
I SGD to minimize the squared loss iteratively computes:
pi pi � ⌘@Lij(P ,Q)
@pi= pi + ⌘("ij · qj)
qj qj � ⌘@Lij(P ,Q)
@qj= qj + ⌘("ij · pi)
I where "ij = rij � piqj
64 of 65
Suggested reading
I G. Linden, B. Smith, and J. York. Amazon.com recommendations:Item-to-item collaborative filtering. Internet Computing, IEEE,7(1):76–80, 2003.
I Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques
for recommender systems. Computer, 42(8):30–37, 2009.I X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering
techniques. Advances in Artificial Intelligence, 2009:4, 2009.I F. Ricci, L. Rokach, and B. Shapira. Introduction to recommender
systems handbook. Springer, 2011.I M. D. Ekstrand, J. T. Riedl, and J. A. Konstan. Collaborative filtering
recommender systems. Foundations and Trends in Human-ComputerInteraction, 4(2):81–173, 2011.
I J. A. Konstan and J. Riedl. Recommender systems: from algorithms touser experience. User Modeling and User-Adapted Interaction,22(1-2):101–123, 2012.
65 of 65