Toward a new Protocol to evaluate Recommender Systems

R&D

Toward a New Protocol to EvaluateRecommender Systems

Frank Meyer, Françoise Fessant, Fabrice Clerot, Eric Gaussier

[email protected]

University Joseph Fourier & OrangeRecSys 2012 – WorkShop on Recommendation Utility Evaluation

2012 – v1.18

Orange R&D Orange FT-groupp2

Summary

Introduction

1. Industrial tasks for recommender systems

2. Industrial (off line) protocol

3. Main results

Conclusion and future works


Summary

Introduction



3. Main results


Orange R&D Orange FT-group

Recommender systems

For industrial applications

Amazon, Google News, Youtube (Google), ContentWise, BeeHive

(IBM),...

as for well-known academic realizations

Fab, More, Twittomender,...

the recommendation is multi-facetted

pushing items, sorting items, linking items...

and cannot be reduced to the rating prediction of a score of

interest of a user u for an item i.

What is a good recommender system?

just a system accurate for rating prediction for top N blockbusters and

top M big users?

... or something else?

p4


Summary

Introduction



3. Main results



Industrial point of view

Main goals of the automatic recommendation:

to increase sales

to increase the audience (click rates...)

to increase customer’s satisfaction and loyalty

Main needs (analysis at Orange: TV, Video On Demand,

shows, web-radios,...)

1. Helping all the users: big users and small users

2. recommending all the items : frequently purchased/viewed items,

rarely purchased/viewed items

3. Helping users on different identified problems

1. should I take this item?

2. should I take this item or that one?

3. what should interest me in this catalog?

4. what is similar to this item?

p6


We propose 4 key functions Help to Explore (navigate)

Given an item i used as a context, give N items similar to i.

Help to Decide

Given an user u, and an item i, give a predictive score of

interest of u for i (a rating).

Help to Compare

Given a user u and a list of items i1,…,in, sort the items in

a decreasing order according to the score of interest for u.

Help to Discover

Given a user u, give N interesting items for u.

Example:

Example:

Example:

Example:

http://www.netflix.com/

http://amazon.com/


Decide/ Compare / Discover / ExploreFunction Quality criteria Measure

Decide The rating prediction must be precise.

Extreme errors must be penalized

because they may more often lead to

a wrong decision.

Existing measure: RMSE

Compare The ranking prediction must be good

for any couple of items of the catalog

(not only for a Top N).

Existing measure: NDPM

(or number of compatible orders)

Discover The recommendation must be useful. Existing measure : Precision

Problem: if one recommends only well-

known blockbusters (i.e. Star Wars,

Titanic...) one will be precise but not useful!

Introducing the Impact Measure

Explore Problem: the semantic relevance is

not evaluable without user’s feedback.Introducing a validation method

for a similarity measure


Summary

Introduction



3. Main results



Known Vs Unknown, Risky Vs Safe

Probability that

the user already

knows the item

Probability that

the user likes

the item

Very bad

recommendation

the user does not know

the item: if he trusts the

systems, he will be misled

Bad

recommendation

But the item is

generally known by

name by the user

Trivial

recommendation

correct but not often

useful

Very good

recommendation

Help to Discover

Recommending an item for a user...


Measuring the Help to Discover

Average Measure of Impact

Recommendation impact

Impact if the user

dislikes the item

Impact if the user likes

the item

Recommending a

popular item

slightly negative slightly positive

Recommending a rare,

unknown item

Strongly negative Strongly positive

List Z of

recommended

items

List H of logs

(u,i,r) in the

Test Set

Size of the

catalog

(normalization)

Impact: rarity of the

items * relative rating of

the user u (according to

her mean of ratings)

Proba user

likes

Proba user

already

knows


userID, itemID, noteuserID, itemID, noteuserID, itemID, rating

Principle of the protocol

LOGS

Learn

TEST

Model

For each list of itemIDs for each userID in Test :

Sort the list according to the ratings, compare the strict

orders of the rating with the order given by the model

For each userID in Test:

generate a list of recommended items; for each of this

items actually rating by userID in Test, evaluate the

relavance

RMSE

%COMP(% compatible)

AMI

Datasets used:

MovieLens 1M and Netflix.

No long tail distribution

detected in Netflix neither in

MovieLens’ dataset

So we use the simplest

segmentation according to

the mean of the number of

ratings: light/heavy users,

popular/unpopular items

Simple mean-based

item/user segmentation

For each (userID, itemID) in Test:

generate a rating prediction, compare with true rating


We will use 4 algorithms to validate the protocol

Uniform Random Predictor

Returns a rating between 1 and 5 (min et max) with a random uniform

distribution

Default Predictor (mean of item + mean of user )/2

Robust mean of the items: requires at least 10 ratings on the item, otherwise

use only the user’s mean

K-Nearest Neighbor item method

Use K nearest neighbors per item, a scoring method detailed below, a

similarity measure called Weighted Pearson. Uses the Default predictor when

an item cannot be predicted

• Ref: Candillier, L., Meyer, F., Fessant, F. (2008). Designing Specific Weighted

Similarity Measures to Improve Collaborative Filtering Systems. ICDM 2008: 242-255

Fast factorization method

Fast Factorization Algorithm, with F factors, known as Gravity (“BRISMF”

implementation)

• Ref: Takács, G., Pilászy, I., Németh, B., Tikk, D. (2009): Scalable Collaborative

Filtering Approaches for Large Recommender Systems. Journal of Machine Learning

Research 10: 623-656 (2009)


What about “Help to Explore”?

How to compare the “semantic quality” of the link between 2 items?

Principle

Define a similarity measure that could be extracted from the model

use the similarity measure to build an item-item similarity matrix

use the similarity matrix as a model for a recommender system using a KNN item-item

model

if this system obtains good performances for RMSE, %COMP, and AMI then the

semantic quality of the similarity measure must be good

Application

for a KNN-item model this is immediate (there is an intrinsic similarity)

for a matrix factorization model, we can use a similarity measure (as Pearson)

computed on the items’ factors

for a random rating predictor, this is not applicable...

for a mean-based rating predictor, this is not applicable...


p15

Evaluating “Help To Explore” for Gravity

rows of

items

columns of users

items X users

matrix of

ratings

matrix of

items’ factors

matrix of

users’ factors

(not used)

Gravity (fast

Matrix

Factorization)

items’ similarity

computations and K

Nearest Neighbors

search, using the matrix

of items’ factors

Similarity

Matrix (KNN)

of the items

(model for a

recommender

system)

KNN based

recommender

system

Possible evaluation of the

quality of this similarity matrix

via RMSE, %Comp, AMI...


Summary

Introduction



3.Main results



Finding 1: different performances according to the segments

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

0 50 100 150 200

RM

SENumber of KNN

RMSE for KNN on Netflix

rmse av. Default Pred.

rmse

rmse Huser Pitem

rmse Luser Pitem

rmse Huser Uitem

rmse Luser Uitem

0.75

0.8

0.85

0.9

0.95

1

1.05

0 10 20 30 40 50 60 70

RM

SE

Number Of Factors

RMSE for Gravity on Netflixrmse av. Default Pred

rmse

rmse Huser Pitem

rmse Luser Pitem

rmse Huser Uitem

rmse Luser Uitem

RMSE Light users Unpopular items (Luser Uitem)

RMSE Light users Popular items (Luser Pitem)

RMSE Heavy users Unpopular items (Huser Uitem)

RMSE Heavy users Popular items (Huser Pitem)

+ RMSE (global)

+ Default predictor

the 4

segments

analyzed

We have a decrease in performance of more than

25% between heavy user popular item segment

and light user unpopular item segment


65.00%

67.00%

69.00%

71.00%

73.00%

75.00%

77.00%

0 20 40 60

%C

om

pa

tib

leNumber of factors

Ranking compatibility for Gravity - Netflix

%Compatible Default Pred

%compatible

%compatible Huser Pitem

%compatible Luser Pitem

%compatible Huser Uitem

%compatible Luser Uitem

0.75

0.8

0.85

0.9

0.95

1

1.05

0 10 20 30 40 50 60 70

RM

SE

Number Of Factors


rmse

rmse Huser Pitem

rmse Luser Pitem

rmse Huser Uitem

rmse Luser Uitem

p18

Finding 2: RMSE not strictly linked to the other performances

Example on 2 segments...

RMSE Light users Unpopular items

RMSE Light users Popular items

RMSE Heavy users Unpopular items

RMSE Heavy users Popular items

RMSE (global)

Default predictor (global)

the light user popular item segment is

easier to optimize than the light user

unpopular item segment for RMSE

the light user popular item segment is as

difficult to optimize as the light user

unpopular item segment for Ranking


0.75

0.8

0.85

0.9

0.95

1

1.05

0 10 20 30 40 50 60 70

RM

SE

Number Of Factors


rmse

rmse Huser Pitem

rmse Luser Pitem

rmse Huser Uitem

rmse Luser Uitem

p19

Main Fact 2 (continued): RMSE not strictly linked to the other performances

-1

-0.5

0

0.5

1

1.5

2

2.5

Random Pred Default Pred KNN, K=100 Gravity, F=32

Average Measure of Impact - Netflix

Average Measure of Impact -Netflix

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

0 50 100 150 200

RM

SE

Number of KNN

RMSE for KNN on Netflix

rmse av. Default Pred.

rmse

rmse Huser Pitem

rmse Luser Pitem

rmse Huser Uitem

rmse Luser Uitem

RMSE (global)

Globally, Gravity is better than KNN for RMSE,

but is worse than KNN for Average Measure of

Impact


Global results

Help to Decide / Compare / Discover

Gravity

dominates

for the

RMSE

measure

KNN

dominates on

the heavy

user

segments

The default

Predictor is

very useful for

unpopular (i.e.

infrequent)

item segments


Comparing native similarities with Gravity-based similarities

Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16

factors) Gravity :

1. KNN item-item can be performed on a factorized matrix with little performance loss (and

faster!).

2. Gravity can be used for the “Help to Explore function”

Native KNN

K=100

KNN computed on Gravity's

items factors

K=100, number of

factors=16

RMSE 0.8440 0.8691

Ranking: % compatible 77.03% 75.67%

Precision 91.90% 86.39%

AMI 2.043 2.025

Global time

of the modeling task

5290 seconds 3758 seconds


Summary

Introduction



3. Main results



Conclusion: contributions

As industrial recommendation is multi-facetted we proposed to list the key functions of the recommendation

• Help to Decide, Help to Compare, Help to Discover, Help to Explore

• Note for Help to explore: the similarity feature is mandatory for a recommender system

we proposed to define a dual segmentation of Items and Users

• just being very accurate on big users and blockbuster items is not very useful

For a new offline protocol to evaluate recommender systems we proposed to use the recommender’s key functions with the dual segmentation

• Mapping Key functions with measures

• adding the measure of Impact to evaluate the “Help to Discover” function

• adding a method to evaluate the “Help to Explore” function

we made a demonstration of its utility

• RMSE (Discover) is not strictly linked to the quality of the other functions (Compare, Discover,

Explore) so it is very dangerous to evaluate a recommender system only with RMSE (no guarantee

with the other measures!)

• The mapping of the best algorithm adapted for each couple (function, Segment) could be exploited

to improve the global performances

• + we saw empirically that the KNN approach could be virtualized, performing the similarities

between items on a factorized space built for instance by Gravity


Future works: 3 main axis

1. Evalutation of the quality of the 4 core functions using an online

A/B Testing protocol

2. Hybrid switch system: the best algorithm for the adapted task

according to the user-item-segment

3. KNN virtualization via matrix factorization


Annexes

p25


about this work...

Frank Meyer: Recommender systems in industrial

contexts. CoRR abs/1203.4487: (2012)

Frank Meyer, Françoise Fessant, Fabrice Clérot and Eric

Gaussier: Toward a New Protocol to Evaluate Recommender

Systems. Workshop on Recommender Utility Evaluation, RecSys

2012. Dublin.

Frank Meyer, Françoise Fessant: Reperio: A Generic and

Flexible Industrial Recommender System. Web Intelligence

2011: 502-505. Lyon.


Classic mathematic representation of the recommendation problem

u1 u2 ul un

i1 4 2 ? 5 ? 2 ? 1

i2 4 5 4 5 5 4 1 5 4

4 3 1 1

2 1

ik 3 ? 4 ? 5

? 2

1 4 5

? ? ?

4 5 4 4

3 ?

im 5 ? 2 4

thousands of users

thousands

of items

known

ratings

of

interest

ratings

of

interest

to predict


Well known industrial example: Item-to-Items recommendation (AmazonTM)


Multi-facetted analysis: measures

RMSE

NDPM

Precision

AMI

nb of contradictory

orders

nb of compatible

orders

nb strict orders

given by the user

number of recommeded

items actually evaluable in

the Test Set

predicted rating

real rating

number of logs in the Test

Set

on a same

dataset and on

a same user,

% compatible

directly usable

Average Measure of

Impact


Comparing native similarities with Gravity-based similarities

Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) :

Gravity can be used for the “Help to Explore function”

KNN item-item can be performed on a factorized matrix with little performance loss!.


Reperio C-V5 Centralized mode, example of a movie recommender


Reperio E-V2 Embedded Mode, example of a TV program recommender

Toward a new Protocol to evaluate Recommender Systems

Technology

Transcript of Toward a new Protocol to evaluate Recommender Systems