A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

66
A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis

Transcript of A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

A more efficient Collaborative Filtering method

Tam Ming Wai

Dr. Nikos Mamoulis

Outline

Introduction to Collaborative Filtering Special nature of CF Inverted File Search Algorithm Item-based Slope-one Hybrid method No random access Experiment

Collaborative Filtering

Looking for opinions from similar taste friends

The active user collaborate to other users Trust those who are similar taste more

Example

i1 i2 i3 i4 i5 iaua 1 2 3 4 5 ?

u1 1 2 3 4 5 5

u2 5 4 3 2 1 1

ua trust u1 more than u2

Special nature of CF

Trust your feeling in the following a few slides

Searching for similar users

Which user is the best one to trust in order to predict “?” ?

Everyone Only i2 is relevant

i1 i2 i3 i4 iaua - 2 - - ?

u1 - 2 - - 3

u2 1 2 - - 1

u3 - 2 2 - 4

u4 2 2 - 3 2

u5 1 2 2 1 4

Similarity

The similarity is not based on all attributes (the items)

Only the items which the active user rated are relevant

Although some suggested (Breese al. et.) more items could be considered (by default voting), it is not popular.

Searching for similar users

Which user is the best one to trust in order to predict “?” ?

Everyone except u5

i1 i2 i3 i4 iaua 1 2 3 5 ?

u1 1 - - - 3

u2 - 2 - - 1

u3 - - 3 - 4

u4 - - - 5 2

u5 - - - - 4

Similarity

The similarity is not based on all attributes (the items)

Only the items which both the active user and the user under consideration rated are relevant

A Notice

ua is similar to u1, u2, u3 and u4

BUT

u1, u2, u3 and u4 are totally not relevant to each other

Searching for similar users

Which user is the best one to trust in order to predict “?” ?

u3 is the one.

Only u3 is relevant

i1 i2 i3 i4 iaua 1 2 3 4 ?

u1 1 2 3 5 -

u2 2 3 1 4 -

u3 4 3 2 1 4

u4 2 1 1 3 -

u5 1 4 2 1 -

Top-k most similar users

It is not the top-k of among all users It is the top-k of among the users who

rated ia

Summary on the nature

The matrix is incomplete Similarity

The set of items could be different for every pair of users (the intersect)

The set of users (the candidates) could be different for each query (those who rated ia)

No triangle inequality (in extreme, ua is similar to u1, u2; but u1 and u2 can be irrelevant)

Popular Similarity measure

Very often, Pearson Correlation is used:

j iterate through the items that rated by both user i and user a

Vote (rating) on item j by user a Average vote (rating) of user a

Output - Prediction

C is a set of users who Rated the queried Item

Brute Force Searching

Given an active user and active movie:Relevant movies are known from the active

user profileCandidates are known from the active movie

profile

Find sim(ua, ui) for all ui in candidate set The top-k are used as advisors

Useful Information

What are the useful information?

i1 i2 i3 i4 iaua 1 2 - 4 ?

u1 - 2 3 - 4

u2 2 3 1 - -

u3 4 - 2 1 4

u4 2 1 - 3 3

u5 1 - 2 1 -

Useful Information

What are the useful information?

i1 i2 i3 i4 iaua 1 2 - 4 ?

u1 - 2 3 - 4

u2 2 3 1 - -

u3 4 - 2 1 4

u4 2 1 - 3 3

u5 1 - 2 1 -

Useful Information

What are the useful information?

The Green entries are useful

i1 i2 i3 i4 iaua 1 2 - 4 ?

u1 - 2 3 - 4

u2 2 3 1 - -

u3 4 - 2 1 4

u4 2 1 - 3 3

u5 1 - 2 1 -

Useful Information

All user profiles

or All movie profiles

Contains the useful information

i1 i2 i3 i4 iaua 1 2 - 4 ?

u1 - 2 3 - 4

u2 2 3 1 - -

u3 4 - 2 1 4

u4 2 1 - 3 3

u5 1 - 2 1 -

Inverted file

Item

1 2 3 4 5 6

User

1 - 1 - 3 4 5

2 1 3 4 5 - 5

3 - 3 - 4 1 -

Item

1

2

3

4

5

6

2 1

1 3

2 4

1 4

1 5

2 5 3 4

1 1 2 3 3 3

3 1

2 5

Coster & Svensson 2002

Pearson Correlation

The active user is fixed in a single query For each user i, there are 3 summations Instead of calculate the w(a,i) for each user i, calculate

SAI[i], SAA[i] and SII[i] for all users (with help of inverted list)

SAA[i]

SAI[i]

SII[i]

Early Termination

Self-Indexing Inverted Files for Fast Text Retrieval, Alistair Moffat and Justin Zobel, 1994

QuitStop when number of user reaches a threshold

ContinueStop consider new users when number of user

reaches a threshold

Item-based

The matrix is symmetric Exchange the role of row (user profile) and

column (movie profile) Looks for movies which are similar to the

active movie If the users act similarly to both movies,

the active user may act similarly too.

Item-based example

The users act exactly the same on i2 and ia

Perhaps i2 and ia are very similar

? May be 1, as ua give i2 rating 1

i1 i2 i3 i4 iaua 1 1 3 4 ?

u1 1 1 3 5 1

u2 2 2 1 4 2

u3 4 4 2 1 4

u4 2 4 1 3 4

u5 1 5 2 1 5

Sarwar et al 2001Pre-find top-k similar items

Amazon.comPersonal promotion on the top-k similar items

Slope-one

Not only find similar items Measure the pattern between items Lemire & Maclachlan 2005

Slope-one

For items pair j and i For all users rated both items Find the average difference in rating

Slope-one

A prediction is made based on devj,i

Slope-one example

All users gave ia higher rating than i3 by 1

By considering ia and i3, ua may rate ‘?’ as 4

i1 i2 i3 i4 iaua 1 4 3 4 ?

u1 1 2 1 5 2

u2 2 2 1 4 2

u3 4 4 3 1 4

u4 2 4 3 3 4

u5 1 5 4 1 5

Summary

A common argumentThere are less items than users

Pre-computationSimilarity in item-baseddevj,i in slope-one

Hybrid method

Finding top-k similar users Brute force

Inefficient when number of candidate is large Inverted file

Inefficient when number of relevant items is large

Mixing the 2

Hybrid method

Inverted file again The files are segmented according to

ratings

I1

Segmented inverted file example

All users here given I1 rating 5

All users here given I1 rating 4

All users here given I1 rating 3

All users here given I1 rating 2

All users here given I1 rating 1

Accessing Segmented inverted file

First access the segments which is closer to the active user’s rating

I1

Access example

Access order 1, d=0 ua here

Access order 2, d=1

Access order 3, d=1

Access order 4, d=2

Access order 5, d=3

Accessing Segmented inverted file

The inverted file is a list ranked on d (distance to ua’s rating)

The best bound on similarity can be found

Algorithm

phase 1Access all inverted lists, such that all d=0

segments are loadedStarting from the most frequently seen

candidates, find the actual similarity (totally k candidates are needed)

The similarity of the k th candidate who actual similarity is known will be the initial filter

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

Algorithm phase 1 example

candidate actual similarity

u3 0.89

u8 0.88

… …

u1 0.77

u9 0.70

filter

K

Algorithm

phase 2 – keep loading form the inverted lists The best bound of the similarity decreases Similarity bound is worse than filter => pruned The partial information is more complete Update filter after some number of segments are load Stop when number of remaining candidate is small

Algorithm – phase 2

In the implementation, the items rated by ua extremely (close to 1 or 5) are loaded first

The candidates’ best bound drop faster

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

Similarity measure

Additive L1 Segmental Manhattan Distance

= Manhattan Distance / # of relevant items

Sim=1-(SMD)/(maximum distance)

Horting

To ensure the intersect of items is large enough

Aggarwal et al

Horting

i1 i2 i3 i4 i5 iaua 1 2 3 4 5 ?

u1 1 2 3 4 5 5

u2 1 - - - - 1

Sim(ua, u1) = Sim(ua, u2)

u2 is less reliable

Best bound

We have ‘user num of appearance’ ‘max num of more appearance’ = min(ua_profile.len, ui_profile.len) –

‘user num of appearance’

if never see this user in any segment best distance = 1 else if ( partial distance > 1 ) The user appear in unseen items, and d=1 else if (‘max num of more appearance’ < horting_factor) The user appear enough number of times only else The user does not appear anymore, partial distance is the

best

No random access

The inverted file is a list ranked on d (distance to ua’s rating)

Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006

phase 1 Do not find any actual similarity until

The best bound of an unseen user isworse thanThe k th best worst bound

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

Worst Bound

While a user’s partial distance is smaller than the maximum possible distance include the distance

No random access

phase 2Find actual similarity and prune candidates

Experiment

Netflix dataset 480189 users 17770 movies 100 million ratings (1.17%)

k = 50 h = 10

Efficiency

Brute force 185.24s per query Hybrid 25.85s per query NRA 59.34s per query

Disk IO statistic (hybrid)

% of actual similarity7.60%

% of entries loaded from inverted file68.52%

% of entries which loaded and relevant49.77%

Reference

Breese et al Empirical Analysis of Predictive Algorithms for Collaborative Filtering

Coster & Svensson 2002 Inverted File Search Algorithms for Collaborative Filtering

Lemire & Maclachlan 2005 Slope One Predictors for Online Rating-Based Collaborative Filtering

Sarwar et al 2001 ItemBased Collaborative Filtering Recommendation Algorithms

Aggarwal et al Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative

Filtering Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung

2006 Efficient Aggregation of Ranked Inputs