Google News Personalization

22
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai

description

Google News Personalization. Big Data reading group November 12, 2007 Presented by Babu Pillai. Problem: finding stuff on Internet. Know what you want: content-based filtering, search Don’t know browse How to handle: Don’t know but, show me something interesting!. Google News. - PowerPoint PPT Presentation

Transcript of Google News Personalization

Page 1: Google News Personalization

Google News Personalization

Big Data reading groupNovember 12, 2007

Presented by Babu Pillai

Page 2: Google News Personalization

Problem: finding stuff on Internet

• Know what you want: – content-based filtering,– search

• Don’t know– browse

• How to handle: Don’t know but, show me something interesting!

Page 3: Google News Personalization

Google News• Top Stories

• Recommendationsfor registered users

• Based on userclick history,community clicks

Page 4: Google News Personalization

Problem Scale

• Lots of users, (more is good)– Millions of clicks from millions of users

• Problem: high churn in item set– Several million items (clusters of news articles

about the same story, as identified by GN) per month

– Continuous addition, deletion

• Strict timing (few hundred ms)• Existing systems not suitable

Page 5: Google News Personalization

Memory-based Ratings

• General form:

where r is rating of item sk for user ua, and w(ua,ui) is similarity between users ua and ui

• Problem: scalability, even when similarity is computed offline

Page 6: Google News Personalization

Model-based techniques

• Clustering / segmentation, e.g. based on interests

• Bayesian models, Markov Decision, …– All are computationally expensive

Page 7: Google News Personalization

What’s in this paper?

• Investigate 2 different ways to cluster users: MinHash, and PLSI

• Implement both on MapReduce

Page 8: Google News Personalization

Google News Rating Model

• 1 click = 1 positive vote

• Noisier than 1-5 ranking (Netflix)

• No explicit negatives

• Why might it work? Partly due to the fairly significant article clips provided, so a user that clicks is likely genuinely interested

Page 9: Google News Personalization

Design guidelines for a scalable rating system

• Associate users into clusters of similar users (based on prior clicks, offline)

• Users can belong to multiple clusters

• Generate rating using much smaller sets of user clusters, rather than all users:

Page 10: Google News Personalization

Technique 1: MinHash

• Probabilistically assign users to clusters based on click history

• Use Jaccard coefficient:

distance is a metric

• Using this metric is computationally expensive, not feasible even offline

Page 11: Google News Personalization

MinHash as a form of Locality Sensitive Hashing

• Basic idea: assign hash value to each use based on click history

• How: randomly permute set of all items; assign id of first item in this order that appears in the user’s click history as the hash value for the user

• Probability that 2 users have the same hash is equal to the Jaccard coefficient

Page 12: Google News Personalization

Using MinHash for clusters

• Concatenate p>1 such hashes as cluster id for increased precision

• Apply q>1 in parallel (users belong to q clusters) to improve recall

• Don’t actually maintain p*q permutations: hash item id with random seed to get proxy for permutation index, for p*q different seeds

Page 13: Google News Personalization

MinHash on MapReduce

• Generate p x q hashes for each user based on click history; generate q p-long cluster ids by concatenation

• Map using cluster id’s as keys

• Reduce to form membership lists for each cluster id

Page 14: Google News Personalization

Technique 2: PLSI clustering

• Probabilistic Latent Semantic Indexing• Main idea: hidden state z that correlates

users and items

• Generate this clustering from training set based on EM algorithm give by Hoffman04– Iterative technique, generates new probability

estimates based on previous estimates

Page 15: Google News Personalization

PLSI as MapReduce

• Q* can be independently computed for each (u,s), given prior N(z,s), N(z), p(z|u): map to RxK machines (R, K partitions for u, s respectively)

• Reduce is simply addition

Page 16: Google News Personalization

PLSI in a dynamic environment

• Treat Z as user clusters

• On each click, update p(s|z) for all clusters the user belongs to

• This approximates PLSI, but is updated dynamically as additional items are added

• Does not allow additions of users

Page 17: Google News Personalization

Cluster-based recommendation

• For each cluster, maintain number of clicks, decayed by time, for each item visited by a member

• For a candidate item, lookup user’s clusters, add up age-discounted visitation counts, normalized by total clicks

• Do this using both MinHash and PLSI clustering

Page 18: Google News Personalization

One more technique: Covisitation

• Memory-based technique• Create adjacency matrix between all pairs of

items (can be directed)• Increment corresponding count if one item

visited soon after another

• Recommendation: for candidate item j, sum of all counts from i to j for all items i in recent click history of user, normalized appropriately

Page 19: Google News Personalization

Whole System

• Offline clustering

• Online click history update, cluster item stats update, covisitation update

Page 20: Google News Personalization

Results

Generally around 30-50% better than popularity based recommendations

Page 21: Google News Personalization

Techniques don’t work well together, though

Page 22: Google News Personalization

Discussion

• Covisitation appears to work as well as clustering

• Operational details missing: how big are cluster memberships, etc.

• All of the clustering is done offline