SVD and the Netflix Dataset

29
SVD Applied to Collaborative Filtering ~ URUG 7-12-07 ~

description

Short summary and explanation of LSI (SVD) and how it can be applied to recommendation systems and the Netflix dataset in particular.

Transcript of SVD and the Netflix Dataset

Page 1: SVD and the Netflix Dataset

SVD Applied to Collaborative Filtering

~ URUG 7-12-07 ~

Page 2: SVD and the Netflix Dataset

Recommendation System

Page 3: SVD and the Netflix Dataset

Recommendation SystemAnswers the question:

What do I want next?!?

Page 4: SVD and the Netflix Dataset

Recommendation System

Very consumer driven.

Must provide good results or a user may not trust the system in the future.

Answers the question:What do I want next?!?

Page 5: SVD and the Netflix Dataset

Collaborative FilteringBase user recommendations off of:

User’s past history.

History of like-minded users.

View data as product X user matrix.

Find a “neighborhood” of similar users for that user.

Return the top-N recommendations.

Page 6: SVD and the Netflix Dataset

Early Approaches

Goldberg, et. al. (1992), Using collaborative filtering to weave an information tapestry

Konstan, J., el. at (1997), Applying Collaborative Filtering to Usenet news.

Use Pearson Correlation or cosine similarity as a measure of similarity to form neighborhoods.

Page 7: SVD and the Netflix Dataset

Early CF Challenges

Page 8: SVD and the Netflix Dataset

Early CF Challenges

Sparsity - No correlation between users can be found. Reduced coverage occurs.

Page 9: SVD and the Netflix Dataset

Early CF Challenges

Sparsity - No correlation between users can be found. Reduced coverage occurs.

Scalability - Nearest neighbor algorithms computation time grows with the number of products and users.

Page 10: SVD and the Netflix Dataset

Early CF Challenges

Sparsity - No correlation between users can be found. Reduced coverage occurs.

Scalability - Nearest neighbor algorithms computation time grows with the number of products and users.

Synonymy

Page 11: SVD and the Netflix Dataset

Dimensionality Reduction

Page 12: SVD and the Netflix Dataset

Dimensionality ReductionLatent Semantic Indexing (LSI)

Page 13: SVD and the Netflix Dataset

Dimensionality ReductionLatent Semantic Indexing (LSI)

Algorithm from IR community (late 80s-early 90s.)

Page 14: SVD and the Netflix Dataset

Dimensionality ReductionLatent Semantic Indexing (LSI)

Algorithm from IR community (late 80s-early 90s.)

Addresses the problems of synonymy, polysemy, sparsity, and scalability for large datasets.

Page 15: SVD and the Netflix Dataset

Dimensionality ReductionLatent Semantic Indexing (LSI)

Algorithm from IR community (late 80s-early 90s.)

Addresses the problems of synonymy, polysemy, sparsity, and scalability for large datasets.

Reduces dimensionality of a dataset and captures the latent relationships.

Page 16: SVD and the Netflix Dataset

Dimensionality ReductionLatent Semantic Indexing (LSI)

Algorithm from IR community (late 80s-early 90s.)

Addresses the problems of synonymy, polysemy, sparsity, and scalability for large datasets.

Reduces dimensionality of a dataset and captures the latent relationships.

Easily maps to CF!

Page 17: SVD and the Netflix Dataset

Dimensionality ReductionLatent Semantic Indexing (LSI)

Algorithm from IR community (late 80s-early 90s.)

Addresses the problems of synonymy, polysemy, sparsity, and scalability for large datasets.

Reduces dimensionality of a dataset and captures the latent relationships.

Easily maps to CF!

Page 18: SVD and the Netflix Dataset

Framing LSI for CFProducts X Users matrix instead of Terms X Documents.

480,189 users, 17,770 movies, only ~100 milion ratings.

17,770 X 480,189 matrix that is 99% sparse!

About 8.5 billion potential ratings.

Netflix Dataset

Page 19: SVD and the Netflix Dataset

SVD- The math behind LSISingular Value Decomposition

For any M x N matrix A of rank r, it can decomposed as:

A = U!V TU is a M x M orthogonal matrix.V is a N X N orthogonal matrix.Σ is a M x N diagonal matrix whose first r diagonal entries are the nonzero singular values of A.

!1 ! !2...! !r > !r+1 = ... = !n = 0

Page 20: SVD and the Netflix Dataset

Related to eigenvalue decomposition (PCA)

U is the orthornormal eigenspace of AA^T. Spans the “column space”, known as left singular vectors.

V is the orthornormal eigenspace of A^TA. Spans “row space”. Right vectors.

Singular values are the square roots of the eigenvalues.

Page 21: SVD and the Netflix Dataset

Reducing Dimensionality

A_k is the closest approximation to A.

A_k minimizes the Frobenius norm over all rank-k matrices:

Ak = Uk!kV Tk

||A!Ak||F

Page 22: SVD and the Netflix Dataset

Making RecommendationsCosine Similarity- common way to find neighborhood.

cos(i, j) =i · j

||i||2 ! || j||2Somehow base recommendations off of that neighborhood and its users.

Can also make predictions of products with a simple dot product if the singular values are combined with the singular vectors.

CPprod = Cavg +UkS1/2k (c) · S1/2

k V Tk (p)

Page 23: SVD and the Netflix Dataset

Challenges with SVDScalability - Once again, compute time grows with the number of users and products. O(m^3)

Offline stage.

Online stage.

Even doing the SVD computation offline is not possible for large datasets. Other methods are needed.

Page 24: SVD and the Netflix Dataset

Incremental SVD

uk = uTVk!!1k

Page 25: SVD and the Netflix Dataset

Incremental SVD Results

Page 26: SVD and the Netflix Dataset

GHA for SVD

Gorrell (2006),GHA for Incremental SVD in NLP

Based off of Sanger’s (1989) GHA for eigen decomposition.

!cai = cb

i · b(x!"j<i

(a · caj)c

aj)

!cbi = ca

i · a(b!"j<i

(b · cbj)c

bj)

Page 27: SVD and the Netflix Dataset

GHA extended by Funk

void train(int user, int movie, real rating) { real err = lrate * (rating - predictRating(movie, user));

userValue[user] += err * movieValue[movie]; movieValue[movie] += err * userValue[user]; }

Page 28: SVD and the Netflix Dataset

Netflix Results

Best RMSEs

0.9283

0.9212

Blended to get 0.9189, 3.42% better than Netflix.

Page 29: SVD and the Netflix Dataset

SummarySVD provides an elegant and automatic recommendation system that has the potential to scale.

There are many different algorithms to calculate or at least approximate SVD which can be used in offline stages for websites that need to have CF.

Every dataset is different and requires experimentation with to get the best results.