Kdd15 - distributed personalization

Click here to load reader

Embed Size (px)

Transcript of Kdd15 - distributed personalization

Aug 11st, 2015Xu Miao, Lijun Tang, Yitong Zhou, Joel Young LinkedInChun-te Chu, MicrosoftAnmol Bhasin Groupon

Distributed Personalization

MotivationDistributed LearningPersonalizationExperiments

I will start with a motivating example, and divide the aspects into two parts, distributed learning and personalization. I will talk about them individually, and demonstrate some experiments results.

Recommendation

I open the linkedins app, and read this piece of post, and feel that it is very interesting.

Recommendation

i click Like. Or, I feel that it is irrelevant, so click Hide.

Recommendation

Now the question is: what is the next post we are going to present?

Common SolutionApps

Tracking

ETL

DM

Delivering

This is a classic recommendation problem, and the common solution is to collect many users feedbacks, send through Tracking system and ETL to a data mining system, say, hadoop. A smart data scientist starts to stare at the data, and does fancy modeling. Eventually deliver a model to improve the recommendation.

Common Solution -- Cold StartAppsTracking

ETL

DM

Delivering

minutes hours days

Apps

seconds seconds seconds

However, the problem with this cold start modeling is that users interaction might be within seconds, tracking results might be within minutes. When the data ETLd, it is hours later. When the model is delivered, possibly days are passed by. Users interests might have drifted away.

Common Solution -- Warm StartAppsTracking

ETL

DM

Delivering

minutes hours days

seconds seconds seconds

To remedy this problem, a warm start model can be built more frequently. Say using the last one hours users feedbacks to build a warm start model. This helps capturing trending information, and improves relevance a lot. However, it is still not optimal. You train the model using last hours active users data, but apply the model on active users for the next hour. These two batch of users might not be sharing same interests. Still not personalized enough.

seconds seconds secondsBring ML Closer to UsersAppsTracking

ETL

DM

Delivering

minutes hours days

So, why not push the warm start learning closer to the users? Say, on the client side that it can react immediately to users feedback.

Distributed Online LearningDefinition:Agent presents an example User responses with a reward rAgent updates the model w

This is commonly considered as the scenario of distributed online learning. The definition of online learning usually is the following: an intelligent agent presents an example, x and y, x might be a post, a news, and a job. y is the relevance score computed by the current model w.User responses with a reward r, +1 for Like, -1 for Hide. The intelligent agent then updates the model w immediately and the loop continue.

Distributed Online LearningDefinition:Agent presents an example User responses with a reward rAgent updates the model wChallenges:Users feedback data too fewDistributed Learning

The challenge here is: each user feedback data is too few, can not get a reliable model individually. We need a distributed learning to leverage everyones knowledge quickly.

Distributed Online LearningDefinition:Agent presents an example User responses with a reward rAgent updates the models Challenges:Users feedback data too fewDistributed LearningEveryone has different preferencesPersonalization

Everyone has different preference, we need personalized model to be trained. This means that we will need to update many models simultaneously.

MotivationDistributed LearningPersonalizationExperiments

Now lets look at the distributed learning first. and put the personalization aside right now.

Bulk Synchronous Parallel (Hadoop & Spark)~ Thousands of interactions to convergeDistributed Gradient Descent

A very popular distributed learning approach is gradient descent. But they have difficulties to apply in our scenario. For example, bulk synchronous parallel requires thousands of interactions for each user to converge. This can be quite annoying.

Stale Synchronous Parallel [Ho and etc. 13]For some users, staleness is foreverDistributed Gradient Descent

What did I do?

Popular method like Stale Synchronous Parallel allows bounded asynchroncy and significantly reduces each user interactions if lots of users online simultaneously. However, users behaviour are different. Some user is very active, and clicks frequently;some is very passive, and seldom clicks. SSP requires the fastest one wait for the slowest one if it is too far ahead. In online setting, we can not do this, since some user might stale forever. We need the asynchronouness unbounded.

BlessingIt is one of the key reasons for PGDs to converge fastChallengeIt goes diminished, and the data comes later has smaller and smaller impactRestart? Residue constant? Hard to manage

Learning Rate

The biggest difficulty in practice is the decayed learning rate. If a user comes late into the game, his contribution to the system will be very small. If we restart the learning periodically, then all the optimality property becomes irrelevant because the restarting procedure causes big overheads.

Alternating Direction Method of Multipliers (ADMMs)

That is why we look into another popular optimization technique, ADMMs.

ADMMs -- Bulk Synchronous Parallel

ADMMs are very easy to be implemented in synchronous fashion. Say these orange cross nodes are client machines. We can compute the individual models and their dual variables in client machines.

ADMMs -- Bulk Synchronous Parallel

The blue plus node represents a server machine, and we can merge the consensus model in the server node after receiving all users models.

ADMMs -- Asynchronous Parallel[Miao, Chu, Tang, Zhou, Young, Bhasin 15]

timelines

V1V1V1t0t1t2

The question is how to make is completely asynchronous. The strategy that we adopt is very simple, like a version control system. For example, at t0, user 1 and user 2 pull the model V1 at the same time, and start their model adaptation individually. At time t1, user1 finishes first, and pushes its model back. The server merges the change and publish V1. At time t2, user2 finishes. Because it is originated from V1 too, we can merge it with V1 easily and publish V1. This is simple.

ADMMs -- Asynchronous Parallel[Miao, Chu, Tang, Zhou, Young, Bhasin 15]

timelines

V1V1V1t0t1t2t3t4V2

Things get complicated when the versions branch out. For example, at t3, user 3 pulls model V1, and pushes back at t4. Because it comes from V1, not from V1. We can not merge it with V1 directly. So, we branch into V2.

V1ADMMs -- Asynchronous Parallel[Miao, Chu, Tang, Zhou, Young, Bhasin 15]

Weighted Merge11timelines

V1V1t0t1t2V2V3t3t4

Now how do we merge V1 and V2 together? We use a weighted average. Because both versions have one contribution since V1, they are equally weighted.

ADMMs -- Asynchronous Parallel[Miao, Chu, Tang, Zhou, Young, Bhasin 15]

Master Versionstimelines

In this way, the server side will keep publishing master versions, and clients side only pulls the latest one and adapts. The whole system becomes asynchronous.

ADMMs -- Asynchronous Parallel[Miao, Chu, Tang, Zhou, Young, Bhasin 15]Same convergence rate as Bulk Synchronous ParallelNo learning rateOut-of-order sequences of mini-optimizationsContinuous Learning

It is easy to reduce this process to a multi-block ADMM program. Since multi-block ADMM converges linearly, our convergence rate remains the same. This means that each users interaction does not need to be a large number as long as we have enough number of users online simultaneously. This relieves us from managing learning rate, and supports learning continuously.

MotivationDistributed LearningPersonalizationExperiments

Now lets focus on how to use Asynchronous ADMMs to solve the personalization

Personalized Models

We use a regularization term to allow each individual model diverge from the consensus model a little bit. Gamma controls how much personalization is allowed.

Personalized Models

then allows each client holds a copy of the consensus model, and requires the local copy to be the same as the consensus model eventually. This turns the optimization into ADMMs

Personalized ModelsThe personalization strength:Allow divergence of personal models from the consensus modelImprove relevanceImprove convergence (speed)

The personalization improves relevance, and convergence at the same time. Most of iterations for ADMMs spent to converge are reaching an agreement over individual models. Now we allow divergence, ADMMs converge a lot faster.

MotivationDistributed LearningPersonalizationExperiments

Facial Expression Recognition

The first experiment we did is to recognize human facial expression. We align the face landmarks first, and compute the features. Throw the feature vector into a classifier to classify whether a person is happy or not.

Facial Expression Recognition

baseline is a random forest model (a stronger model than linear model)personalized model is doing much better

Facial Expression Recognition

breakdown accuracy for different personalization strength, gamma=50 is doing the best

The vertical lines represent the variance of TPR at a certain FPR point to account the a