Privacy-Preserving News Recommendation Model Learning · 2020. 11. 12. · 1425 our method can...

10
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1423–1432 November 16 - 20, 2020. c 2020 Association for Computational Linguistics 1423 Privacy-Preserving News Recommendation Model Learning Tao Qi 1 , Fangzhao Wu 2 , Chuhan Wu 1 , Yongfeng Huang 1 and Xing Xie 2 1 Department of Electronic Engineering & BNRist, Tsinghua University, Beijing 100084, China 2 Microsoft Research Asia, Beijing 100080, China {taoqi.qt,wuchuhan15}@gmail.com [email protected] {fangzwu,xing.xie}@microsoft.com Abstract News recommendation aims to display news articles to users based on their personal in- terest. Existing news recommendation meth- ods rely on centralized storage of user be- havior data for model training, which may lead to privacy concerns and risks due to the privacy-sensitive nature of user behaviors. In this paper, we propose a privacy-preserving method for news recommendation model train- ing based on federated learning, where the user behavior data is locally stored on user devices. Our method can leverage the useful informa- tion in the behaviors of massive number users to train accurate news recommendation mod- els and meanwhile remove the need of cen- tralized storage of them. More specifically, on each user device we keep a local copy of the news recommendation model, and com- pute gradients of the local model based on the user behaviors in this device. The local gradi- ents from a group of randomly selected users are uploaded to server, which are further aggre- gated to update the global model in the server. Since the model gradients may contain some implicit private information, we apply local differential privacy (LDP) to them before up- loading for better privacy protection. The up- dated global model is then distributed to each user device for local model update. We repeat this process for multiple rounds. Extensive ex- periments on a real-world dataset show the ef- fectiveness of our method in news recommen- dation model training with privacy protection. 1 Introduction With the development of Internet and mobile Inter- net, online news websites and Apps such as Yahoo! News 1 and Toutiao 2 have become very popular for people to obtain news information (Okura et al., 1 https://news.yahoo.com 2 https://www.toutiao.com/ 2017). Since massive news articles are posted on- line every day, users of online news services face heavy information overload (Zheng et al., 2018). Different users usually prefer different news infor- mation. Thus, personalized news recommendation, which aims to display news articles to users based on their personal interest, is a useful technique to improve user experience and has been widely used in many online news services (Wu et al., 2019b). The research of news recommendation has attracted many attentions from both academic and industrial fields (Okura et al., 2017; Wang et al., 2018; Lian et al., 2018; An et al., 2019; Wu et al., 2019a). Many news recommendation methods have been proposed in recent years (Wang et al., 2018; Wu et al., 2019b; Zhu et al., 2019b). These methods usually recommend news based on the matching between the news representation learned from news content and the user interest representation learned from historical user behaviors on news. For ex- ample, Okura et al. (2017) proposed to learn news representations from the content of news articles via autoencoder, and learn user interest represen- tations from the clicked news articles via Gated Recurrent Unit (GRU) network. They ranked the candidate news articles using the direct dot prod- uct of the news and user interest representations. These approaches all rely on the centralized storage of user behavior data such as news click histories for model training. However, users’ behaviors on news websites and Apps are privacy-sensitive, the leakage of which may bring catastrophic conse- quences. Unfortunately, the centralized storage of user behavior data in server may lead to high pri- vacy concerns from users and risks of large-scale private data leakage. In this paper, we propose a privacy-preserving method for news recommendation model training. Instead of storing user behavior data on a cen- tral server, in our method it is locally stored on

Transcript of Privacy-Preserving News Recommendation Model Learning · 2020. 11. 12. · 1425 our method can...

  • Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1423–1432November 16 - 20, 2020. c©2020 Association for Computational Linguistics

    1423

    Privacy-Preserving News Recommendation Model Learning

    Tao Qi1, Fangzhao Wu2, Chuhan Wu1, Yongfeng Huang1 and Xing Xie21Department of Electronic Engineering & BNRist, Tsinghua University, Beijing 100084, China

    2Microsoft Research Asia, Beijing 100080, China{taoqi.qt,wuchuhan15}@gmail.com [email protected]

    {fangzwu,xing.xie}@microsoft.com

    Abstract

    News recommendation aims to display newsarticles to users based on their personal in-terest. Existing news recommendation meth-ods rely on centralized storage of user be-havior data for model training, which maylead to privacy concerns and risks due to theprivacy-sensitive nature of user behaviors. Inthis paper, we propose a privacy-preservingmethod for news recommendation model train-ing based on federated learning, where the userbehavior data is locally stored on user devices.Our method can leverage the useful informa-tion in the behaviors of massive number usersto train accurate news recommendation mod-els and meanwhile remove the need of cen-tralized storage of them. More specifically,on each user device we keep a local copy ofthe news recommendation model, and com-pute gradients of the local model based on theuser behaviors in this device. The local gradi-ents from a group of randomly selected usersare uploaded to server, which are further aggre-gated to update the global model in the server.Since the model gradients may contain someimplicit private information, we apply localdifferential privacy (LDP) to them before up-loading for better privacy protection. The up-dated global model is then distributed to eachuser device for local model update. We repeatthis process for multiple rounds. Extensive ex-periments on a real-world dataset show the ef-fectiveness of our method in news recommen-dation model training with privacy protection.

    1 Introduction

    With the development of Internet and mobile Inter-net, online news websites and Apps such as Yahoo!News1 and Toutiao2 have become very popular forpeople to obtain news information (Okura et al.,

    1https://news.yahoo.com2https://www.toutiao.com/

    2017). Since massive news articles are posted on-line every day, users of online news services faceheavy information overload (Zheng et al., 2018).Different users usually prefer different news infor-mation. Thus, personalized news recommendation,which aims to display news articles to users basedon their personal interest, is a useful technique toimprove user experience and has been widely usedin many online news services (Wu et al., 2019b).The research of news recommendation has attractedmany attentions from both academic and industrialfields (Okura et al., 2017; Wang et al., 2018; Lianet al., 2018; An et al., 2019; Wu et al., 2019a).

    Many news recommendation methods have beenproposed in recent years (Wang et al., 2018; Wuet al., 2019b; Zhu et al., 2019b). These methodsusually recommend news based on the matchingbetween the news representation learned from newscontent and the user interest representation learnedfrom historical user behaviors on news. For ex-ample, Okura et al. (2017) proposed to learn newsrepresentations from the content of news articlesvia autoencoder, and learn user interest represen-tations from the clicked news articles via GatedRecurrent Unit (GRU) network. They ranked thecandidate news articles using the direct dot prod-uct of the news and user interest representations.These approaches all rely on the centralized storageof user behavior data such as news click historiesfor model training. However, users’ behaviors onnews websites and Apps are privacy-sensitive, theleakage of which may bring catastrophic conse-quences. Unfortunately, the centralized storage ofuser behavior data in server may lead to high pri-vacy concerns from users and risks of large-scaleprivate data leakage.

    In this paper, we propose a privacy-preservingmethod for news recommendation model training.Instead of storing user behavior data on a cen-tral server, in our method it is locally stored on

  • 1424

    (and never leaves) users’ personal devices, whichcan effectively reduce the privacy concerns andrisks (McMahan et al., 2017). Since the behaviordata of a single user is insufficient for model train-ing, we propose a federated learning based frame-work named FedNewsRec to coordinate massiveuser devices to collaboratively learn an accuratenews recommendation model without the need tocentralized storage of user behavior data. In ourframework, on each user device we keep a localcopy of the news recommendation model. Sincethe user behaviors on news websites or Apps storedin user device can provide important supervisioninformation of how the current model performs, wecompute the model gradients based on these localbehaviors. The local gradients from a group of ran-domly selected users are uploaded to server, whichare further aggregated to update the global newsrecommendation model maintained in the server.The updated global model is then distributed toeach user device for local model update. We repeatthis process for multiple rounds until the trainingconverges. Since the local model gradients mayalso contain some implicit private information ofusers’ behaviors on their devices, we apply the lo-cal differential privacy (LDP) technique to theselocal model gradients before uploading them toserver, which can better protect user privacy at thecost of slight performance sacrifice. We conductextensive experiments on a real-world dataset. Theresults show that our method can achieve satis-factory performance in news recommendation bycoordinating massive users for model training, andat the same time can well protect user privacy.

    The major contributions of this work include:(1) We propose a privacy-preserving method

    to train accurate news recommendation model byleveraging the behavior data of massive users andmeanwhile remove the need to its centralized stor-age to protect user privacy.

    (2) We propose to apply local differential privacyto protect the private information in local gradientscommunicated between user devices and server.

    (3) We conduct extensive experiments on a real-world dataset to verify the proposed method inrecommendation accuracy and privacy protection.

    2 Related Work

    2.1 News Recommendation

    News recommendation can be formulated as a prob-lem of matching between news articles and users.

    There are three core tasks for news recommenda-tion, i.e., how to model the content of news articles(news representation), how to model the personalinterest of users in news (user representation), andhow to measure the relevance between news con-tent and user interest. For news representation,many feature based methods have been applied. Forexample, Lian et al. (2018) represented news us-ing URLs, categories and entities. Recently, manydeep learning based news recommendation meth-ods represent news from the content using neuralnetworks. For example, Okura et al. (2017) used de-noising autoencoder to learn news representationsfrom news content. Wu et al. (2019c) proposedto learn news representations from news titles viamulti-head self-attention network. For user repre-sentation, existing news recommendation methodsusually model user interest from their historicalbehaviors on news platforms. For example, Okuraet al. (2017) learned user representations from thepreviously clicked news using GRU network. Anet al. (2019) proposed a long- and short-term userrepresentation model (LSTUR) for user interestmodeling. It captures the long-term user interestvia user ID embedding and the short-term user in-terest from the latest news click behaviors via GRU.For measuring the relevance between user interestand news content, dot product of user and newsrepresentation vectors is widely used (Okura et al.,2017; Wu et al., 2019a; An et al., 2019). Somemethods also explore cosine similarity (Zhu et al.,2019b), feed-forward network (Wang et al., 2018),feature-interaction network (Lian et al., 2018).

    These existing news recommendation methodsall rely on centrally-stored user behavior data formodel training. However, users’ behaviors on newsplatforms are privacy-sensitive. The centralizedstorage of user behavior data may lead to seriousprivacy concerns of users. In addition, the newsplatforms have high responsibility to prevent userdata leakage, and have high pressure to meet therequirements of strict user privacy protection regu-lations like GDPR3. Different from existing newsrecommendation methods, in our method the userbehavior data is locally stored on personal devices,and only the model gradients are communicatedbetween user devices and server. Since the modelgradients contain much less user information thanthe raw behavior data and they are futher processedby the Local Differential Privacy (LDP) technique,

    3https://eugdpr.org/

  • 1425

    our method can protect user privacy much betterthan existing news recommendation methods.

    2.2 Federated Learning

    Federated learning (McMahan et al., 2017) isa privacy-preserving machine learning techniquewhich can leverage the rich data of massive usersto train shared intelligent models without the needto centrally store the user data. In federated learn-ing the user data is locally stored on user mobiledevices and never uploaded to server. Instead, eachuser device computes a model update based onthe local data, and the locally-computed model up-dates from many users are aggregated to update theshared model. Since model updates usually containmuch less information than the raw user data, therisks of privacy leakage can be effectively reduced.Federated learning requires that the labeled datacan be inferred from user interactions for super-vised model learning, which can be perfectly satis-fied in our news recommendation scenario, sincethe click and skip behaviors on news websites andApps can provide rich supervision information.

    Federated learning has been applied to trainingquery suggestion model for smartphone keyboardand topic models (Jiang et al., 2019). There are alsosome explorations in applying federated learning torecommendation (Ammad et al., 2019; Chai et al.,2019). For example, Ammad et al. (2019) proposeda federated collaborative filtering (FCF) method.In FCF, the personal rating data is locally storedon user client and is used to compute the local gra-dients of user embeddings and item embeddings.The user embeddings are locally maintained in userclient and are directly updated using the local gra-dient on each client. The item embeddings aremaintained by a central server, and are updated us-ing the aggregated gradients of many clients. Chaiet al. (2019) proposed a federated matrix factor-ization method, which is very similar with FCF.However, these methods require all users to par-ticipate the process of federated learning to traintheir embeddings, which is not practical in real-world recommendation scenarios. Besides, thesemethods represent items using their IDs, and aredifficult to handle new items since many news arti-cles are posted every day which are all new items.Thus, these federated learning based recommenda-tion methods have their inherent drawbacks, andare not suitable for news recommendation.

    2.3 Local Differential PrivacyLocal differential privacy (LDP) is an importanttechnique to provide guarantees of privacy for sen-sitive information collection and analysis (Renet al., 2018). It has attracted increasing attentionssince user privacy protection has become a moreand more important issue (Kairouz et al., 2014; Qinet al., 2016). A classical scenario of LDP is thatthere are a set of users, and each user u has a pri-vate value v, which is sent to a untrusted third-partyaggregator so that the aggregator can learn somestatistical information of the private value distribu-tion among the users (Cormode et al., 2018). LDPcan guarantee that the leakage of private informa-tion for each individual user is bounded by apply-ing a randomized algorithmM to private value vand sending the perturbed valueM(v) to the ag-gregator for statistical information inference. Therandomized algorithmM(·) is called to satisfy �-local differential privacy if and only if for two ar-bitrary input private values v and v′, the followinginequation holds:

    Pr[M(v) = y] ≤ e� Pr[M(v′) = y], (1)

    where y ∈ range(M). � ≥ 0, and it is usu-ally called privacy budget. Smaller � meansbetter private information protection. In manyworks (Sarathy and Muralidhar, 2010; Duchi et al.,2013), M(·) is implemented by adding Laplacenoise to the private value. In this paper we applyLDP technique to the model gradients which aregenerated in user devices based on user behaviorsand uploaded to server, to better protect user pri-vacy and remove the need to a trusted server.

    3 FedNewsRec for Privacy-PreservingNews Recommendation

    In this section we introduce our FedNewsRecmethod for privacy-preserving news recommen-dation model training. We first describe the newsrecommendation model. Then we describe the de-tails of FedNewsRec.

    3.1 Basic News Recommendation ModelFollowing previous works (Wu et al., 2019a; Anet al., 2019), the news recommendation model inour method can be decomposed into two core sub-models, i.e., a news model to learn news representa-tions and a user model to learn user representations.

    News Model The news model aims to learnnews representations to model news content. Its ar-

  • 1426

    NewsEmbedding

    ···

    News TitleTrump claims Ukraine

    ···

    !

    ···

    Word Embedding

    ···

    Attention

    Self-Attention

    ······CNN

    Figure 1: The architecture of news model.

    chitecture is shown in Fig. 1. Following (Wu et al.,2019b), we learn news representations from newstitles. The news model contains four layers stackedfrom bottom to up. The first layer is word embed-ding, which converts the word sequence in a newstitle into a sequence of semantic word embeddingvectors. The second layer is a CNN network, whichis used to learn word representations by capturinglocal contexts. The third layer is a multi-head self-attention network (Vaswani et al., 2017), which canlearn contextual word representations by modelingthe long-range relatedness between different words.The fourth layer is an attention network, which isused to build a news representation vector t fromthe output of multi-head self-attention network byselecting informative words.

    User Model The user model is used to learnuser representations to model their personal inter-est. Its architecture is shown in Fig. 2. Follow-ing (Okura et al., 2017), we learn user representa-tion from their clicked news articles. Motivated bythe LSTUR model proposed by An et al. (2019),we learn representations of users by capturing bothlong-term and short-term interests. The differenceis that in LSTUR the embeddings of user IDs areused to model long-term interest, while in our usermodel it is learned from all the historical behav-iors through a combination of a multi-head self-attention network and an attentive pooling network.This is because in the federated learning scenario, itis not practical that all users can participate the pro-cess of model training. Thus, the ID embeddings ofmany users in LSUTR cannot be learned. For short-term user interest modeling, our user model appliesa GRU network to the recent behaviors of users,which is the same with LSUTR. The embeddings

    User Embedding

    ···

    ··· ···

    Historical clicked news

    Short-term user embedding

    !N-M+1

    "

    "#"$ Long-term user embedding

    Attention

    !N-M+2 !N!1

    GRU GRU GRUAttention

    Self-Attention

    ······

    Figure 2: The architecture of user model.

    of long-term interest and short-term interest arecombined with an attention network into a unifieduser embedding vector u.

    Model Training from User Behavior Users’behaviors on news websites and Apps can provideuseful supervision information to train the newsrecommendation models. For example, if a useru clicks a news article t which has low rankingscore predicted by the model, then we can tune themodel to give higher ranking score for this user-news pair. We propose to train the news recommen-dation model based on both click and non-clickbehaviors. More specifically, following (Wu et al.,2019b), for each news tci clicked by user u, werandomly sample H news which are displayed inthe same impression but not clicked. Assume thisuser has Bu click behaviors in total, then the lossfunction of the news recommendation model withparameter set Θ is defined as:

    Lu(Θ) =Bu∑i=1

    Li, (2)

    Li = − log( exp(s(u, tci ))

    exp(s(u, tci )) +∑H

    j=1 exp(s(u, tnci,j))

    ),

    (3)where tci and t

    nci,j are clicked and non-clicked news

    articles shown in the same impression. s(u, t) is theranking score of news t for user u, which is definedas the dot product of their embedding vectors, i.e.,s(u, t) = uT t.

    3.2 The Framework of FedNewsRecNext, we introduce our FedNewsRec framework forprivacy-preserving news recommendation model

  • 1427

    !

    Aggregator

    Update

    Upload

    Global Model

    Model Gradients

    User iLossℒ#

    ServerModel

    Gradients User j

    Lossℒ$

    Distribute Updated Model

    Distribute Updated Model

    User Logs ℬ#Randomized

    Gradients

    LDP Module

    LDP Module

    Randomized Gradients

    User Logs ℬ$

    Local ModelNewsModel

    User Model

    NewsModel

    User Model

    Clipping

    Clipping

    Local ModelNewsModel

    User Model

    Figure 3: The framework of our privacy-preserving news recommendation approach.

    training, which is shown in Fig. 3. In our Fed-NewsRec framework, the user behaviors on newsplatforms (websites or Apps) are locally stored onthe user devices and never uploaded to server. Inaddition, the servers which provide news servicesdo not record nor collect the user behaviors, whichcan reduce the privacy concerns of users and therisks of data leakage. Since an accurate news rec-ommendation model can effectively improve users’news reading experiences and the behavior datafrom a single user is far from sufficient for trainingan accurate and unbiased model, in our FedNews-Rec framework we propose to coordinate a largenumber of user devices to collectively train intelli-gent news recommendation models.

    Following (McMahan et al., 2017), each user de-vice which participates the model training is calleda client. Each client has a copy of the current newsrecommendation model Θ which is maintained bythe server. Assume user u’s client has accumulateda set of behaviors on news platforms which is de-noted as Bu, then we compute a local gradient ofmodel Θ according to the behaviors Bu and theloss function defined in Eq. (3), which is denotedas gu = ∂Lu∂Θ . Although the local model gradientgu is computed from a set of behaviors rather thana single behavior, it may still contain some privateinformation of user behaviors (Zhu et al., 2019a).Thus, for better privacy protection, we apply localdifferential privacy (LDP) technique to the localmodel gradients. Denote the randomized algorithmapplied to gu asM, which is defined as:

    M(gu) = clip(gu, δ) + n, (4)

    n ∼ La(0, λ), (5)

    where n is Laplace noise with 0 mean. The param-

    eter λ can control the strength of Laplace noise,and larger λ can bring better privacy protection.The function clip(x, y) is used to limit the valueof x with the scale of y. It is motivated by somestudies which show that applying gradient clippingcan help avoid potential gradient explosion and isbeneficial for model training (Zhang et al., 2019).Denote the randomized gradient as g̃u = M(gu).After clip and randomization operation, it is moredifficult to infer the raw user behaviors from thegradients. The user client uploads the randomizedlocal model gradient g̃u to the server.

    In our FedNewsRec framework, a server is usedto maintain the news recommendation model andupdate it via the model gradients from a large num-ber of users. In each round, the server randomlyselects a random fraction r (e.g., 10%) of the userclients, and sends the current news recommenda-tion model Θ to them. Then it collects and aggre-gates the local model gradients from the selecteduser clients as follows:

    g =1∑

    u∈U |Bu|∑u∈U|Bu| · g̃u, (6)

    where U is the set of users selected for the learningprocess in this round, and Bu is the set of behaviorsof user u for local model gradient computation.

    Then the aggregated gradient g is used to updatethe global news recommendation model Θ main-tained in the server:

    Θ = Θ− η · g, (7)

    where η is the learning rate. The updated globalmodel is then distributed to user devices to updatetheir local models. This process is repeated untilthe model training converges.

  • 1428

    3.3 Discussions on Privacy ProtectionNext, we discuss why our FedNewsRec frameworkcan protect user privacy in news recommendationmodel training. First, in our method the privateuser behavior data is stored on user own devices,and is never uploaded to server. Only the modelgradients inferred from the local user behaviors arecommunicated with the server. According to thedata processing inequality (McMahan et al., 2017),these gradients never contain more private informa-tion than the raw user behaviors, and usually con-tain much less information (McMahan et al., 2017).Thus, the user privacy can be better protected com-pared with the centralized storage of user behaviordata as did in existing news recommendation meth-ods. Second, the local model gradients are com-puted from a group of user behaviors instead of asingle behavior. Thus, it is not very easy to infera specific behavior from the local model gradientsuploaded to server. Third, we apply the local dif-ferential privacy technique to the local model gra-dients before uploading by adding Laplace noiseto them. It can strengthen the privacy protectionof the private information in local model gradients.According to (Choi et al., 2018), Laplace noise inLDP can achieve �-local differential privacy, and� =

    maxv,v′ |M(v)−M(v′)|λ , where v and v

    ′ are ar-bitrary values in local model gradient. Since theupper bound of maxv,v′ |M(v) −M(v′)| in ourFedNewsRec framework is 2δ, the upper bound ofthe privacy budget � is 2δλ . We can see that by in-creasing λ (i.e., the strength of the noise), we canachieve a smaller privacy budget �which means bet-ter privacy protection. However, strong noise willhurt the accuracy of aggregated gradients. Thus, λshould be selected based on the trade-off betweenprivacy protection and model performance.

    4 Experiment

    4.1 Dataset and Experimental SettingsOur experiments were conducted on a public newsrecommendation dataset (named Adressa) collectedfrom a Norwegian news website (Gulla et al.,2017) and another real-world dataset collectedfrom Microsoft News4 (named MSN-News).5 Forthe Adressa dataset, following Hu et al. (2020), weused user logs in the first five days to construct

    4https://www.msn.com/en-us5Our dataset and codes will be publicly available in

    https://github.com/JulySinceAndrew/FedNewsRec-EMNLP-Findings-2020.

    MSN-News Adressa# users 100,000 528,514# news 118,325 16,004

    # impressions 1,341,853 -# positive behaviors 2,006,289 2,411,187# negative behaviors 48,051,601 -

    avg. # title length 11.52 6.60

    Table 1: The statistical information of the dataset.

    users’ click history, used logs in the 6-th day formodel training, and used logs in the 7-th day formodel evaluation. Since the Adressa dataset doesnot contain non-clicked data, we randomly sampled20 news as negative samples for each click to con-struct the test set. For the MSN-News dataset, werandomly sampled 100,000 users and their behav-ior logs in 5 weeks (from October 19 to November22, 2019). We assume that the behavior logs ofdifferent users are stored in a decentralized way tosimulate the real application of privacy-preservingnews recommendation model training. We used thebehaviors in the last week for test and the remain-ing behaviors for training. In addition, since inpractical applications not all users can participatethe model training, we randomly selected half ofthe users for training and tested the model on allusers. The detailed statistics of the two datasets arelisted in Table 1. Following (Wu et al., 2019b), weused the average scores of AUC, MRR, nDCG@5,nDCG@10 of all impressions in the test set to eval-uate the performance. We repeated each experi-ment five times and reported average results andstand errors.

    In experiments we used the 300-dimensional pre-trained Glove embedding to initialize word embed-dings. The number of the self-attention head is 20and the output dimension of each head is 20. Thedimension of GRU hidden state is 400. H in Eq. (3)is 4. The fraction r of users participating in modeltraining in each round is 2%. The learning rate η inEq. (7) is 0.5. δ in Eq. (4) is 0.005 and λ in Eq. (5)is 0.015. These hyper-parameters are all selectedaccording to cross-validation on the training set.

    4.2 Effectiveness Evaluation

    First, we verify the effectiveness of the proposedFedNewsRec method. We compared with manymethods, including: (1) FM (Rendle, 2012), factor-ization machine, a classic method for recommen-dation; (2) DFM (Lian et al., 2018), deep fusionmodel for news recommendation; (3) EBNR (Okura

  • 1429

    MethodMSN-News Adressa

    AUC MRR nDCG@5 nDCG@10 AUC MRR nDCG@5 nDCG@10FM 58.41±0.04 27.19±0.05 28.98±0.04 34.57±0.06 61.94±0.80 26.59±0.33 22.69±0.54 32.17±0.46

    DFM 61.25±0.26 28.68±0.10 30.62±0.21 36.38±0.23 65.14±0.69 34.74±0.89 33.17±1.46 39.79±1.08EBNR 63.64±0.15 29.50±0.14 31.57±0.13 37.38±0.20 65.70±0.72 30.23±0.49 29.37±0.53 36.38±0.44DKN 62.38±0.19 29.40±0.15 31.59±0.11 37.27±0.21 67.53±1.90 32.33±2.79 31.84±2.78 39.96±2.52DAN 62.54±0.23 29.44±0.18 31.67±0.14 37.31±0.25 64.03±3.10 33.37±2.63 31.61±3.03 38.60±3.02

    NAML 64.52±0.24 30.93±0.17 33.39±0.16 39.07±0.19 69.20±2.07 35.18±1.49 34.78±1.85 42.34±1.97NPA 64.29±0.20 30.63±0.15 33.11±0.17 38.89±0.23 66.70±2.42 34.68±1.77 33.72±2.09 41.18±1.99

    NRMS 65.72±0.16 31.85±0.20 34.59±0.18 40.25±0.17 67.97±2.23 33.16±2.54 32.37±3.59 40.41±2.82FCF 51.03±0.27 22.24±0.14 22.97±0.21 28.44±0.23 53.33±1.28 23.04±2.68 20.24±2.77 27.09±2.61

    FedNewsRec 64.65±0.15 30.60±0.09 33.03±0.11 38.77±0.10 69.91±2.53 35.55±1.85 33.74±2.45 41.47±2.78CenNewsRec 66.45±0.17 31.91±0.22 34.62±0.18 40.33± 0.24 71.02±2.09 36.31±2.52 35.73±3.71 43.98±2.52

    Table 2: The news recommendation results of different methods.

    et al., 2017), using GRU for user modeling (Choet al., 2014); (4) DKN (Wang et al., 2018), usingknowledge-aware CNN network for news repre-sentation in news recommendation; (5) DAN (Zhuet al., 2019b), using CNN to learn news represen-tations from both news title and entities and usingLSTM to learn user representations; (6) NAML (Wuet al., 2019a), learning news representations via at-tentive multi-view learning; (7) NPA (Wu et al.,2019b), using personalized attention network tolearn news and user representations; (8) NRMS (Wuet al., 2019d), learning representations of newsand users via multi-head self-attention network;(9) FCF (Ammad et al., 2019), a federated collab-orative filtering method for recommendation; (10)CenNewsRec, which has the same news recommen-dation model with FedNewsRec but is trained oncentralized user behavior data.

    The results are listed in Table 2. First, by com-paring FedNewsRec with SOTA news recommenda-tion methods such as NRMS, NPA and EBNR, ourmethod can achieve comparable and even betterperformance on news recommendation. It validatesthe effectiveness of our approach in learning ac-curate models for personalized news recommenda-tion. Moreover, different from these existing newsrecommendation methods which are all trained oncentralized storage of user behavior data, in ourFedNewsRec the user behavior data is stored onlocal user devices and is never uploaded. Thus, ourmethod can train accurate news recommendationmodel and meanwhile better protect user privacy.

    Second, our method can perform better than ex-isting federated learning based recommendationmethods like FCF (Ammad et al., 2019). The per-formance of FCF is not good in news recommen-dation. This is because FCF requires each user andeach item to participate the training process to learn

    their embeddings. However, in practical applica-tion not all the users can participate the trainingdue to different reasons. In addition, news arti-cles on online news platforms expire very quickly,and new news articles continuously emerge. Thus,many items for recommendation are news items,and unseen in the training data, which cannot behandled by FCF. In our method we learn newsrepresentations from news content and learn userrepresentations from their behaviors using neuralmodels. Therefore, our method can handle theproblem of new users and new items, and is moresuitable for news recommendation scenario.

    Third, FedNewsRec performs worse than Cen-NewsRec which has the same news recommenda-tion model with FedNewsRec but is trained on thecentralized user behavior data. This is intuitivesince centralized data is more beneficial for modeltraining than decentralized data. In addition, inFedNewsRec we apply local differential privacytechnique with Laplace noise to protect the privateinformation in model gradients, which leads to theaggregated gradient for model update less accu-rate. Luckily, the gap between the performanceof FedNewsRec and CenNewsRec is not very big.Thus, our FedNewsRec method can achieve muchbetter privacy protection at the cost of acceptableperformance decline. These results validate theeffectiveness of our method.

    4.3 Influence of User Number

    In this section, we explore whether our FedNews-Rec method can exploit the useful behavior infor-mation of massive users in a federated way to trainaccurate news recommendation models. In thefollowing sections, we only show the experimentalresults on the MSN-News dataset. We randomly se-lect different numbers of users for model training,

  • 1430

    1k 5k 10k 20k 30k 40k 50k 60k 80k 100k# Users

    52

    56

    60

    64A

    UC

    AUC

    nDCG@10

    32

    36

    40

    44

    nD

    CG

    @1

    0

    Figure 4: Performance with different numbers of users.

    0.000 0.005 0.010 0.015 0.020 0.025 0.03061

    62

    63

    64

    65

    AUC

    (a) Model performance.

    0.005 0.010 0.015 0.020 0.025 0.0300.0

    0.5

    1.0

    1.5

    2.0

    (b) Privacy budget �.

    Figure 5: Influence of the hyper-parameters λ and δ.

    and use all users for evaluation. The experimentalresults are shown in Fig. 4.

    From Fig. 4 we have several observations. First,when the number of users is small (e.g., 1000),the performance of news recommendation modeltrained on the behavior data of these users is not sat-isfactory. This is because the behaviors of a singleuser are usually very limited, and behavior data ofa small number of users is insufficient to train accu-rate news recommendation model. This result val-idates the motivation of our FedNewsRec methodto coordinate a large number of users in a feder-ated way for model training. Second, when thenumber of users participating in training increases,the performance of FedNewsRec improves. It indi-cates that FedNewsRec can effectively exploit theuseful behavior information from different users tocollectively train an accurate news recommenda-tion model, which validates the effectiveness of ourframework. Third, when the number of users is bigenough, further incorporating more users can onlybring marginal performance improvement. Thisresult shows that a reasonable number of users aresufficient for news recommendation model train-ing, and it is unnecessary to involve too many orall users which is costly and impractical.

    4.4 Hyper-parameter Analysis

    In this section, we explore the influence of hyper-parameters on our method. We show the results oftwo important hyper-parameters, i.e., δ in Eq. (4)

    Figure 6: Convergence of model training.

    and λ in Eq. (5) which serve in the local differen-tial privacy module of our FedNewsRec framework.The results are shown in Fig. 5. In Fig. 5(a) weshow the performance of our method with differentλ and δ values. We find that a large λ value canlead to the performance decline. This is becauselarger λ means stronger Laplace noise added to thegradients in LDP module, making the aggregatedgradient for model update less accurate. In addi-tion, our method tends to have better performancewhen δ is larger. This is because fewer gradientswill be affected in the clip operation when δ islarger. In Fig. 5(b) we show the upper bound of theprivacy budget, i.e., � in Section 3.3, with differentλ and δ values. We can find that with larger λ valueand smaller δ value, the privacy budget � becomeslower, which means better privacy protection. Thisis intuitive, since larger λ value and smaller δ valueindicate that stronger noise is added and more gra-dient values are clipped, making it more difficult torecover the original model gradients. CombiningFig. 5(a) and Fig. 5(b) we can see that the better pri-vacy protection is achieved by some sacrifice of theperformance, and we need to select λ and δ valuesbased on the trade-off between privacy protectionand news recommendation performance.

    4.5 Convergence Analysis

    Next we explore the convergence of the model train-ing in FedNewsRec, and the results are shown inFig. 6. We can see that the training process can con-verge in about 1,500 rounds under different settingsof r (i.e., ratio of selected users for model trainingin each round). It indicates that FedNewsRec cantrain news recommendation model efficiently.

    4.6 Effectiveness of User Model

    In this section, we conduct ablation studies to eval-uate the effectiveness of the short- and long- termuser interest modeling in our user model. The ex-

  • 1431

    AUC [email protected]

    65.3

    65.6

    65.9

    66.2

    66.5

    AU

    C

    39.0

    39.3

    39.6

    39.9

    40.2

    40.5

    nD

    CG

    @1

    0

    - long-term user embedding

    - short-term user embedding

    LSTUR

    Figure 7: Effectiveness of User Model

    perimental results are shown in Fig. 7, from whichwe have several observations. First, after removingthe short-term user embedding, the performance ofour method declines. This is because users some-times tend to read news related to the topics theyrecently cared about (An et al., 2019). Our usermodel learns the short-term user embedding fromthe sequence of users’ recent clicked news via aGRU network, which can effectively capture users’short-term interest. Thus, removing the short-termuser embedding makes the unified user embeddingloss some information of users’ recent reading pref-erence and causes performance decline. Second,after removing the long-term user embedding, theperformance of our method also declines. Thisis because users may read some news accordingto their long-term interests, which may not be re-flected by their recent reading history (An et al.,2019). To address this issue, our user model learnslong-term user embedding by capturing the relat-edness among users’ clicked news, which can ef-fectively capture users’ long-term interest. Afterremoving it, the unified user embedding losses theinformation of the long-term interest, which hurtsthe recommendation accuracy.

    5 Conclusion

    In this paper, we propose a privacy-preservingmethod for news recommendation model training.Different from existing methods which rely oncentralized storage of user behavior data, in ourmethod the user behaviors are locally stored on userdevices. We propose a FedNewsRec framework tocoordinate a large number of users to collectivelytrain accurate news recommendation models fromthe behavior data of these users without the need toupload it. In our method each user client computeslocal model gradients based on the user behaviors

    on device, and sends them to server. The servercoordinates the training process and maintains aglobal news recommendation model. It aggregatesthe local model gradients from massive users andupdates the global model using the aggregated gra-dient. Then the server sends the updated model touser clients and this process is repeated for multi-ple rounds. In order to further protect the privateinformation in the local model gradients, we applylocal differential privacy to them by adding Laplacenoise. The experiments on real-world dataset showthat our method can achieve comparable perfor-mance with SOTA news recommendation methods,and meanwhile can better protect user privacy.

    Acknowledgments

    This work was supported by the National Key Re-search and Development Program of China underGrant number 2018YFC1604002, and the NationalNatural Science Foundation of China under Grantnumbers U1836204, U1705261, and 61862002.

    ReferencesMuhammad Ammad, Elena Ivannikova, Suleiman A

    Khan, Were Oyomno, Qiang Fu, Kuan Eeik Tan,and Adrian Flanagan. 2019. Federated collab-orative filtering for privacy-preserving personal-ized recommendation system. arXiv preprintarXiv:1901.09888.

    Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang,Zheng Liu, and Xing Xie. 2019. Neural news recom-mendation with long-and short-term user representa-tions. In ACL, pages 336–345.

    Di Chai, Leye Wang, Kai Chen, and Qiang Yang.2019. Secure federated matrix factorization. arXivpreprint arXiv:1906.05108.

    Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder–decoder ap-proaches. In Proceedings of Eighth Workshop onSyntax, Semantics and Structure in Statistical Trans-lation, pages 103–111.

    Woo-Seok Choi, Matthew Tomei, Jose Ro-drigo Sanchez Vicarte, Pavan Kumar Hanumolu,and Rakesh Kumar. 2018. Guaranteeing localdifferential privacy on ultra-low-power systems. InISCA, pages 561–574.

    Graham Cormode, Somesh Jha, Tejas Kulkarni,Ninghui Li, Divesh Srivastava, and Tianhao Wang.2018. Privacy at scale: Local differential privacy inpractice. In SIGMOD, pages 1655–1658.

  • 1432

    John C Duchi, Michael I Jordan, and Martin J Wain-wright. 2013. Local privacy and statistical minimaxrates. In FOCS, pages 429–438.

    Jon Atle Gulla, Lemei Zhang, Peng Liu, ÖzlemÖzgöbek, and Xiaomeng Su. 2017. The adressadataset for news recommendation. In Proceedingsof the international conference on web intelligence,pages 1042–1048.

    Linmei Hu, Siyong Xu, Chen Li, Cheng Yang, ChuanShi, Nan Duan, Xing Xie, and Ming Zhou. 2020.Graph neural news recommendation with unsuper-vised preference disentanglement. In ACL, pages4255–4264.

    Di Jiang, Yuanfeng Song, Yongxin Tong, Xueyang Wu,Weiwei Zhao, Qian Xu, and Qiang Yang. 2019. Fed-erated topic modeling. In CIKM, page 1071–1080.

    Peter Kairouz, Sewoong Oh, and Pramod Viswanath.2014. Extremal mechanisms for local differentialprivacy. In NIPS, pages 2879–2887.

    Jianxun Lian, Fuzheng Zhang, Xing Xie, andGuangzhong Sun. 2018. Towards better represen-tation learning for personalized news recommenda-tion: a multi-channel deep fusion approach. In IJ-CAI, pages 3805–3811.

    Brendan McMahan, Eider Moore, Daniel Ramage,Seth Hampson, and Blaise Aguera y Arcas. 2017.Communication-efficient learning of deep networksfrom decentralized data. In AISTATS, pages 1273–1282.

    Shumpei Okura, Yukihiro Tagami, Shingo Ono, andAkira Tajima. 2017. Embedding-based news rec-ommendation for millions of users. In KDD, pages1933–1942.

    Zhan Qin, Yin Yang, Ting Yu, Issa Khalil, XiaokuiXiao, and Kui Ren. 2016. Heavy hitter estimationover set-valued data with local differential privacy.In CCS, pages 192–203. ACM.

    Xuebin Ren, Chia-Mu Yu, Weiren Yu, Shusen Yang,Xinyu Yang, Julie A McCann, and S Yu Philip.2018. High-dimensional crowdsourced data publi-cation with local differential privacy. TIFS, pages2151–2166.

    Steffen Rendle. 2012. Factorization machines withlibfm. TIST, page 57.

    Rathindra Sarathy and Krish Muralidhar. 2010. Someadditional insights on applying differential privacyfor numeric data. In International Conference onPrivacy in Statistical Databases, pages 210–219.

    Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS, pages 6000–6010.

    Hongwei Wang, Fuzheng Zhang, Xing Xie, and MinyiGuo. 2018. Dkn: Deep knowledge-aware networkfor news recommendation. In WWW, pages 1835–1844.

    Chuhan Wu, Fangzhao Wu, Mingxiao An, JianqiangHuang, Yongfeng Huang, and Xing Xie. 2019a.Neural news recommendation with attentive multi-view learning. IJCAI, pages 3863–3869.

    Chuhan Wu, Fangzhao Wu, Mingxiao An, JianqiangHuang, Yongfeng Huang, and Xing Xie. 2019b.Npa: Neural news recommendation with personal-ized attention. In KDD, pages 2576–2584.

    Chuhan Wu, Fangzhao Wu, Mingxiao An, Tao Qi,Jianqiang Huang, Yongfeng Huang, and Xing Xie.2019c. Neural news recommendation with heteroge-neous user behavior. In EMNLP, pages 4876–4885.

    Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi,Yongfeng Huang, and Xing Xie. 2019d. Neu-ral news recommendation with multi-head self-attention. In EMNLP, pages 6390–6395.

    Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jad-babaie. 2019. Why gradient clipping acceleratestraining: A theoretical justification for adaptivity. InICLR.

    Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, YangXiang, Nicholas Jing Yuan, Xing Xie, and Zhen-hui Li. 2018. Drn: A deep reinforcement learn-ing framework for news recommendation. In WWW,pages 167–176.

    Ligeng Zhu, Zhijian Liu, and Song Han. 2019a.Deep leakage from gradients. arXiv preprintarXiv:1906.08935.

    Qiannan Zhu, Xiaofei Zhou, Zeliang Song, JianlongTan, and Guo Li. 2019b. Dan: Deep attention neuralnetwork for news recommendation. In AAAI, pages5973–5980.