Cao et al. ICML 2010 Presented by Danushka Bollegala.

24
Transfer Learning for Collective Link Prediction in Multiple Heterogenous Domains Cao et al. ICML 2010 Presented by Danushka Bollegala.

Transcript of Cao et al. ICML 2010 Presented by Danushka Bollegala.

Transfer Learning for Collective Link Prediction in Multiple Heterogenous Domains

Cao et al. ICML 2010Presented by Danushka Bollegala.

Link Prediction

Predict links (relations) between entities Recommend items for users (MovieLens, Amazon) Recommend users for users (social recommendation) Similarity search (suggest similar web pages) Query suggestion (suggest related queries by other users)

Collective Link Prediction (CLP) Perform multiple prediction tasks for the same set of

users simultaneously▪ Predict/recommend multiple item types (books and movies)

Pros Prediction tasks might not be independent, one can

benefit from another (books vs. movies vs. food) Less affected by data sparseness (cold start problem)

Transfer Learning+Collective Link Prediction

(this paper)

Gaussian Process for

Regression (GPR)(PRML Sec. 6.4)

Link prediction = matrix factorization

Probabilistic Principal Component Analysis

(PPCA) (Bishop & Tipping, 1999)

PRML Chapter 12.

Probabilistic non-linear matrix factorization

Lawrence &Utrasun, ICML 2009

Task similarityMatrix, T

Link Modeling via NMF

Link matrix X (xi,j is the rating given by user I to item j)

Xi,j is modeled by f(ui, vj, ε) f: link function ui: latent representation of a user i

vj: latent representation of an item j ε: noise term

Generalized matrix approximation Assumption: E is Gaussian noise N(0, σ2I) Use Y = f-1(X) Then, Y follows a multivariate Gaussian

distribution.

Gaussian Process Regression

Revision (PRML Section 6.4)

Functions as Vectors

We can view a function as an infinite dimensional vector f(x): (f(x1), f(x2),...)T

Each point in the domain is mapped by f to a dimension in the vector

In machine learning we must find functions (e.g. linear predictors) that map input values to their corresponding output values We must also avoid over-fitting

This can be visualized as sampling from a distribution over functions with certain properties Preference bias (cf. restriction bias)

Gaussian Process (GP) (1/2)

Linear regression modelWe get different output functions y

for different weight vectors w.Let us impose a Gaussian prior over

w

Train dataset: {(x1,y1),...,(xN,yN)}Targets: y=(y1,...,yN)T

Design matrix

Gaussian Process (2/2)When we impose a Gaussian prior over

the weight vector, then the target y is also Gaussian.

K: Kernel matrix (Gram matrix)

k: kernel function

Gaussian Process: Definition Gaussian process is defined as a probability

distribution over functions y(x) such that the set of values y(x) evaluated at an arbitrary set of points x1,...,xN jointly have a Gaussian distribution.

p(x1,...,xN) is Gaussian. Often the mean is set to zero

Non-informative prior Then the kernel function fully defines the GP.

Gaussian kernel: Exponential Kernel:

Gaussian Process Regression (GPR)

Predict outputs with noise

x y

e

t

Probabilistic Matrix Factorization PMF can be seen as a Gaussian Process with latent

variables (GP-LVM) [Lawrence & Utrasun ICML 2009]

Generalized matrix approximation model

Y=f-1(X) follows a multivariate Gaussian distribution

A Gaussian prior is set on U

Probabilistic PCA model byTipping & Bishop (1999)

Non-linear version

Mapping back to X

Ratings are not Gaussian!

Collective Link Prediction

GP model for each task

A single model for all tasks

Tensor Product

Known as Kronecker product for two matrices (e.g., numpy,kron(a,b))

⊗€

Generalized Link Functions

Each task might have a different rating distribution.

c, α, b are parameters that must be estimated from the data.

We can relax the constraint α > 0 if we have no prior knowledge regarding the negativity of the skewness of the rating distribution.

Predictive distribution

Similar to GPR predictionPredicting y= g(x)

Predicting x

Parameter Estimation

Compute the likelihood of the dataset

Use Stochastic Gradient Descent for optimization

Non-convex optimization Sensitive to initial conditions

Experiments

Setting Use each dataset and predict multiple items

Datasets MovieLens▪ 100000 ratings, 1-5 scale ratings, 943 users, 1682

movies, 5 popular genres Book-Crossing▪ 56148 ratings, 1-10 scale, 28503 users, 9909 books, 4

most general Amazon book categories Douban▪ A social network-based recommendation serivce▪ 10000 users, 200000 items▪ Movies, books, music

Evaluation

Evaluation measure Mean Absolute Error (MAE)

Baselines I-GP: Independent Link Prediction using

GP CMF: Collective matrix factorization ▪ non GP, classical NMF

M-GP: Joint Link prediction using multi-relational GP▪ Does not consider the similarity between tasks

Proposed method = CLP-GP

Results

Note: (1) Smaller values are better (2) with(+)/without(-) link function.

Total data sparseness

Good

Target task data sparseness

Task similarity matrix (T)

Romance and Drama are very similarAction and Comedy are very

dissimilar

My Comments

Elegant model and well-written paperFew parameters (latent space

dimension k) need to be specified All other parameters can be learnt

Applicable to a wide range of tasksCons:

Computational complexity▪ Predictions require kernel matrix inversion▪ SGD updates might not converge▪ The problem is non-convex...