How to Retrain Recommender System?
A Sequential Meta-Learning Method
Yang Zhang, Fuli Feng, Chenxu Wang, Xiangnan He,
Meng Wang, Yan Li, Yongdong Zhang
University of Science and Technology of China , National University of Singapore
Hefei University of Technology, Beijing Kuaishou Technology Co., Ltd. Beijing, China
Outline
Introduction
Our solution
Experiments
Conclusion
Age of Information Explosion
Serious Issue of Information
Overloading
Weibo: >500M posts/day
Flickr:>300M images/day
Kuaishou: >20M micro-videos/day
… …
3
Xiangnan He. Recent Research on Graph Neural Networks for recommendation System
How does a RecSys Work?
Working :
With time goes by,Model may serve more and more bad. Because:
Collect/Clean
… Data
Offline training &
Saving ModelServe Online
interested interested interested
Young Girl Pregnant Woman New Mother
(1) User Interests drifts.
long- and short- term interest!
(2) New users/items are coming…
=> Solution: train the model again
with new collected data.(i.e. retrain)
Full retraining and Fine tuning
method 1 -- Full retraining
Use all previous and new collected to retrain model. (initialed by previous model)
pros: In some case, more data may reflect user interests more accurately
cons: Cost highly (memory and computation) ;
Overemphasis on previous data. (proportion of the last two datasets: t=1: 100% t=9: 20%)
Method 2 – Fine tuning:
In each period, only use new collected data to retrain/adjust previous model.
t=0 t=1 t=2
Pros: fast, low cost (memory, computation)
Cons: overfitting and forgetting issue (long-term interest)
Sample-based method
Method 3 – sample-based methods:
full-retraining: slow, high cost, ignore short-term interest
Fine tuning: Fast, forgetting long-term interest
trade-off : Sample previous data: long-term interest
New data: short-term interest
SPMF:
each period:
• Step 1: Use the sampled previous data and new data to
retrain model
• Step 2: Update the Reservoir (With some P)
Pros: A trade-off between Full-retraining and Fine-tuning.
long- and short-term interest
trade-off between cost and performance
Cons: Not best performance
Human-defined sampling strategies
Other methods: memory-based methods
R(Reservoir)
𝐷𝑡
Retrain
R
Randomly replace
elements in R with P
①
Motivation
Common limitation of existing methods:
lack an explicit optimization towards the retraining objective — i.e., the retrained model should serve
well for the recommendations of the next time period
Problems:one stagey only work well in one scenario, and bad in other scenarios
such as , one sample stagey can’t be suitable for all recommendation scenarios.
This motivates us to design a new method can add the retraining objective optimization process. Meanwhile,
we want to avoid to save previous to save long-term interest, in order to realize efficient retraining.
PROBLEM FORMULATION:
Original form:
Form with our goal: (constrained form)
Only new data can be used, and the goal is that perform well in the next period.
Previous model New model
Outline
Introduction
Our solution
Experiments
Conclusion
Our solution -- key idea
Previous model (parameters) has captured most
information of previous data.
So, save previous model instead of previous data.
During training, consider utilize future data in some
way!
Get a meta-model
Our solution -- framework
Overview
𝑊𝑡−1 𝑊𝑡
Transfer
𝑾𝒕−𝟏: Previous model, contains knowledge in the previous data 𝑾𝒕: A recommendation model to capture new information in the new collected data
Transfer: to combine the “knowledge” contained in 𝑊𝑡−1 𝑎𝑛𝑑 𝑊𝑡. Denote as 𝑓Θ.𝑾𝒕: new recommender model. 𝑊𝑡 = 𝑓Θ(𝑊𝑡−1, 𝑊𝑡)
We need:
A well-designed Transfer has the above ability.
A well-designed training process to make all the modules works.
Previous knowledge/information
long-term interests
New knowledge/information
short-term interests
To combine the two parts knowledge
𝑊𝑡All knowledge/information
Long- and short-term interests
Transfer
Two simple methods(1) pay different attentions to previous and current trained knowledge
𝑊𝑡 = 𝛼𝑊𝑡−1 + 1 − 𝛼 𝑊
Cons : limited representation ability; the relations between different dimensions
(2) MLP: 𝑊𝑡 = 𝑀𝐿𝑃(𝑊𝑡−1|| 𝑊𝑡)
Cons: not emphasize the interactions between the parameters of the same dimension
Our solution -- A CNN-based Neural Networks
Transfer
Stack layer CNN layers FC and output layer
𝑑 × 3 𝑑 × 𝑛1 𝑑 × 𝑛2𝑑𝑛2
𝑑𝑓
𝑑
Stack 𝑤𝑡−1, 𝑤𝑡 andw𝑡−1⊙ 𝑤𝑡
𝛼
to form a 2D picture
If w is not a vector, we
can shape it to a vector.
• 1d and horizontal convolution.
• Capture the relation between the parameters of
the same dimension ---- parameter evolution
[-1,1,0] 𝑤𝑡 −𝑤𝑡−1 𝑀𝐹 ∶ 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 𝑑𝑟𝑖𝑓
[0,0,1] interest vanish or appear
With the second layer, stronger ability
• Capture the relation
between the parameters
of the same dimension
• Recover the shape
Sequential Meta Learning
Directly train our model, only make fit to current period, and even only
make transfer focus on 𝑤𝑡.
-- to make transfer works, we proposed Sequential Meta Learning(SML)
Each period t, alternately update transfer and current model 𝒘𝒕. 𝑬𝒂𝒄𝒉 𝒓𝒐𝒖𝒏𝒅, 𝒕𝒉𝒆𝒓𝒆 𝒂𝒓𝒆 𝒕𝒘𝒐𝒎𝒂𝒊𝒏 𝒔𝒕𝒆𝒑
Step 1: learning 𝑤𝑡
Goal : learn the knowledge in
new collected data 𝐷𝑡
Fix the transfer, minimize:
Black blocks are fixed!
We do learn 𝑤𝑡 by recommendation loss of the
output of transfer instead of itself loss to make
𝑤𝑡 suitable to as the input of transfer.
Sequential Meta Learning
Step 2: learning the transfer
𝚯 ∶ 𝑠ℎ𝑎𝑟𝑒𝑑 𝑎𝑐𝑟𝑜𝑠𝑠 𝑎𝑙𝑙 𝑡𝑎𝑠𝑘𝑠
capture some task-invariant patterns
Goal: obtain such patterns that are tailored
for the next-period recommendations
So, fixed 𝑤𝑡, minimize :
i.e. recommendation loss on 𝐷𝑡+1
Black blocks are fixed!
Then, the above two steps are iterated until convergence or a maximum number of
iterations is reached.
Note that:
(1) we can run multiple passes of such sequential training on 𝐷𝑡 𝑡=0𝑇 when offline
training , to learn a better uniform transfer.
(2) When serving/testing , use the model gained before the first step 2.
Instantiation on MF
All users share a transfer, all item share another transfer.
Loss:
Outline
Introduction
Our solution
Experiments
Conclusion
Setting
Datasets:
𝐀𝐝𝐫𝐞𝐬𝐬𝐚 𝟏 : (1) News clicked data. time-sensitive, choose recent news , short-term interests.
(2) split each day into three periods:
morning (0:00-10:00), afternoon (10:00-17:00) evening (17:00-24:00)
𝐘𝐞𝐥𝐩[𝟐]: (1)users and businesses like restaurants. inherent (long-term) interest
(2)split it into 40 periods with an equal number of interactions
Data splits: offline-training/validation/testing periods:
Adressa: 48/5/10 Yelp: 30/3/7
Evaluation: (1) done on each interaction basis[3].
(2) sample 999 non-interacted items of a user as candidates
(3) Recall@K and NDCG@K (K=5,10,20)
Dataset Interaction
s
users item time span Total
periods
Adressa 3,664,225 478,612 20,875 three weeks 63
Yelp 3,014,421 59,082 122,816 > 10 years 40
[1]. Jon Atle Gulla et.al. 2017. The Adressa dataset for news recommendation. In WI
[2] https://www.yelp.com/dataset/
[3] Xiangnan He et.al. 2017. Neural collaborative filtering. In WWW
Performance
Average performance of testing periods
(1) Our method which only based on MF get best performance, even compared with SOTA
methods
(2) Our method can get good performance on all datasets. Full-retrain and Fine-tune can
only perform well one datasets respectively.
(3) sample-based retraining method SPMF performs better than Fine-tune on Yelp, but not
on Adressa. Drawback of heuristically designed method.
-- wonderful ability that automatically adapt to different scenarios.
-- historical data can be discarded during retraining, as long as the previous model
can be properly utilized
Performance
Each period -- recommendation and speed-up
Adress Yelp
(1) SML achieves the best performance in
most cases
(2) the fluctuations on Adressa are larger than
Yelp. Strong timeliness of the news domain
(1) SML is about 18 times faster than Full-retrain
(2) SML is stable
(3) SML-S (disabling the update of the transfer)
SML-S is even faster than Fine-tune
Our method is efficient
How do the components of SML affect its effectiveness?
Some variants:
SML-CNN: remove CNN SML-FC: remove FC layer
SML-N: disables the optimization of the transfer towards the next-period performance
SML-S: disabling the update of the transfer during testing
SML-FP: learns the 𝑊𝑡 directly based on itself recommendation loss on 𝐷𝑡
CNN and FC layer: both dimension-wise relations and cross-dimension relations between W𝑡 and 𝑊𝑡−1SML-N: worse than SML by 18.81% and 34.53% on average, optimizing towards future performance is
important
SML-S: drops by 7.87% and 9.43%. The mechanism for transfer may need be changed with times goes by.
SML-FP: fails to achieve a comparable performance as SML on both datasets.
Where does improvements come from?
Compared with Full-retrain on Yelp
User(item): new users(items): only occur in the testing data.
old user(items): otherwise
interactions: old user-new item (OU-NI), new user-new item (NU-NI),
old user-old item (OU-OI), and new user-old item (NU-OI)
(1) improvements of SML over Full-retrain are mainly from the recommendations for
new users and new items.
(2) strong ability of SML in quickly adapting to new data
(3) performance on the interaction type of old user-old item is nearly not degraded
Influence of hyper-parameters
Focus on hyper-parameters of CNN component
In some range of hyper-parameters, the performance is stable in some degree.
There are better hyper-parameters.
Conclusion & future works
main contributions:• formulate the sequential retraining process as an optimizable
problem
• new retraining approach:
• Recover knowledge of previous data by previous model instead of
data. It is efficient.
• Effective by optimizing for the future recommendation performance
Future works:
• Implement SML based on other models such as 𝑳𝒊𝒈𝒕𝒉𝑮𝑪𝑵[𝟏]
verify its generality
• Task/category-aware transfer designed.
different users/items may need different mechanism of transfer。
[1]. Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution
Network for Recommendation. In SIGIR
Thank You
Q&A
Transfer
model
⊙
𝑤𝑡
𝑤𝑡−1
.
.
𝐹01
𝐹𝑛11 𝐹1
2
𝐹𝑛22
𝑤𝑡
Top Related