A Review of Privacy Preserving Federated Learning for ... · federated learning to IoT applications...

A Review of Privacy Preserving Federated Learningfor Private IoT Analytics

Christopher BriggsSchool of Computing & Mathematics

Keele UniversityKeele, UK

[email protected]

Zhong FanSchool of Computing & Mathematics


[email protected]

Peter AndrasSchool of Computing & Mathematics


[email protected]

Abstract—The Internet-of-Things generates vast quantities ofdata, much of it attributable to an individual’s activity andbehaviour. Holding and processing such personal data in acentral location presents a significant privacy risk to individuals(of being identified or of their sensitive data being leaked).However, analytics based on machine learning and in particu-lar deep learning benefit greatly from large amounts of datato develop high performance predictive models. Traditionally,data and models are stored and processed in a data centreenvironment where models are trained in a single location.This work reviews research around an alternative approachto machine learning known as federated learning which seeksto train machine learning models in a distributed fashion ondevices in the user’s domain, rather than by a centralised entity.Furthermore, we review additional privacy preserving methodsapplied to federated learning used to protect individuals frombeing identified during training and once a model is trained.Throughout this review, we identify the strengths and weaknessesof different methods applied to federated learning and finally,we outline future directions for privacy preserving federatedlearning research, particularly focusing on Internet-of-Thingsapplications.

Index Terms—federated learning, distributed machine learn-ing, deep learning, data privacy, internet-of-things

I. INTRODUCTION

The Internet-of-Things (IoT) is represented by network-connected machines, often small embedded computers thatprovide physical objects with digital capabilities includingidentification, inventory tracking and sensing & actuator con-trol. Mobile devices such as smartphones also represent a facetof the IoT, often used as a sensing device as well as to controland monitor other IoT devices. The applications that driveanalytical insights in the IoT are often powered by machinelearning and deep learning.

Gartner [1] predicts that 20 billion IoT devices will beinstalled by 2020, forecasting a bright future for IoT appli-cations. However this poses a challenge for traditional cloud-based IoT computing. The volume, velocity and variety of datastreaming from billions of these devices requires vast amountsof bandwidth which can become extremely cost prohibitive.Additionally, many IoT applications require very low-latencyor near real-time analytics and decision making capabilities.The round-trip delay from devices to the cloud and back againis unacceptable for such applications. Finally, transmitting

sensitive data collected by IoT devices to the cloud posessecurity and privacy concerns. Edge computing, and morerecently, fog computing [2] have been proposed as a solutionto these problems.

Edge computing (and its variants: mobile edge computing,multi-access edge computing) restrict analytics processing tothe edge of the network on devices attached to, or very close tothe perception layer [3]. However storage and compute powermay be severely limited and coordination between multipledevices may be non-existent in the edge computing paradigm.Fog computing [2] offers an alternative to cloud computingor edge computing alone for many analytics tasks but sig-nificantly increases the complexity of an IoT network. Fogcomputing is generally described as a continuum of compute,storage and networking capabilities to power applications andservices in one or more tiers that bridge the gap between thecloud and the edge [4], [5]. Fog computing enables highlyscalable, low-latency, geo-distributed applications, supportinglocation awareness and mobility [6]. Despite rising interest infog-based, much research is still focused on deployment ofanalytics applications (including deep learning applications)directly to edge devices.

Performing computationally expensive tasks such as trainingdeep learning models on edge devices poses a challenge due tolimited energy budgets and compute capabilities [7]. In cloudenvironments, massively powerful and scalable servers makinguse of parallelisation are typically employed for deep learningtasks [8]. In fog and edge computing environments, alternativemethods for distributing training are required. Additionally, aslimited bandwidth is a key constraint in computing near/at theedge, the challenge of reducing network data transfer is veryimportant.

This paper provides a comprehensive review of privacypreserving federated learning [9]. We show how federatedlearning is ideally suited for data analytics in the IoT andreview research addressing privacy concerns [10], bandwidthlimitations [11], and power/compute limitations [12]. The restof this paper is organised as follows. section II providesan introduction to preliminary work; the basics of machinelearning, deep learning and distributed deep learning requiredto understand the motivation for federated learning and howit has been applied in the literature. section III introduces

arX

iv:2

004.

1179

4v1

[cs

.LG

] 2

4 A

pr 2

020

federated learning and outlines the major contributions tofederated learning research. Following this, section IV givesan overview of data privacy and methods for preserving theprivacy of an individual’s data. section V follows with ananalysis of privacy preserving methods as applied to federatedlearning to protect latent data. Finally section VI discussesthe major outstanding challenges and future directions to applyfederated learning to IoT applications and section VII presentsconcluding remarks.

II. PRELIMINARIES

A. Machine learning

Empirical risk minimisation (ERM) is a common applica-tion of machine learning to minimise some convex objectivefunction (also known as a loss or cost function). The goal is toreduce the distance between predictions that the model makesand ground truth labels over all the training data:

minw∈Rd

1

m

m∑i=1

fi(w). (1)

Here, m is in the number of examples in the training datasetand w is a vector of parameters which maps inputs to outputsx → y. Common loss functions include the mean squarederror (used in linear regression):

L =1

m

m∑i=1

(y(i) − y(i))2. (2)

Here, y is the prediction (a linear equation such as y =w1x + w2) and y is the true label. For binary classificationproblems (e.g. logistic regression), the log-loss or cross-entropy measures the distance of predictions that the inputbelongs to a given class (between 0 and 1) and the true class:

L =1

m

m∑i=1

−(y(i) log y(i) + (1− y(i)) log (1− y(i))). (3)

The log loss is often bounded to constrain the predictedvalue to fall between 0 and 1 using the sigmoid function:

σ(x) =1

1 + e−x. (4)

Artificial Neural networks (ANNs) are also commonly usedin machine learning. They are composed of multiple connectedunits (also known as neurons) organised into layers throughwhich the training data flows [13]. Each unit computes aweighted sum of its input values (including a bias term)composed with a non-linear activation function g(W>X+b)and returns the result to the next connected layer. ANNs withmany layers have become extremely popular in the field ofdeep learning due to the excellent prediction accuracy achievedby such models. ANNs also have the ability to extract featuresfrom raw data such as images, speech and text, removing thenecessity to hand-engineer the input features [13].

Mathematically, a neural network simply composes multiplefunctions together using weight matrices and biases to trans-form an input matrix to an output vector or matrix. Passingthe data through the network and performing a prediction isknown as the forward pass. To train the network, a backwardpass operation is specified to compute updates to the weightsand biases in order to better approximate the labels associatedwith the training data. An algorithm known as backpropagation[14] is used to propagate the loss back through each layer ofthe network. This is achieved by calculating the derivative ofthe output of each hidden unit with respect to each weight.Combined with the chain rule from calculus, the overall losscalculation can be made from the the errors associated witheach operation in the network.

Both ERM and ANNs require an optimisation process tominimise the overall loss with respect to each model parameterand is most commonly achieved via some variant of gradientdescent.

B. Gradient descent

Gradient descent is commonly used as the optimisationalgorithm for machine learning problems to minimise a costfunction. Gradients are first-order derivatives of the loss withrespect to each parameter and provide the direction in whichto proceed towards the minimum [15]. Each parameter issimultaneously updated by subtracting the gradient multipliedby a scalar learning rate hyperparameter η:

w ← w − η∇ 1

m

m∑i=1

fi(w). (5)

Training proceeds iteratively by calculating the loss usingthe new parameters, computing the gradients and updatingthe parameters again until the minimum is found or the losschanges by only some insignificant factor. When the gradientsare calculated over the entire training set in each step, this isknown as batch gradient descent [15].

Linear regression and logistic regression models can beoptimised quite simply using gradient descent as the costfunction for both algorithms is convex in nature. However,the cost functions for some machine learning methods such asANNs are non-convex. In such cases, gradient descent is notguaranteed to find the global minimum of the cost function.

1) Stochastic gradient descent (SGD): In SGD, a singletraining sample (sampled randomly) is considered during eachstep of gradient descent. SGD will not converge but instead“wander” towards the minimum of a cost function [13]. Eachstep is very fast to compute but the cost will fluctuate up anddown during optimisation. On average, however, the cost willdecrease over time as expected but over a many more steps ofgradient descent.

2) Mini-batch gradient descent: By splitting the trainingsamples into “mini-batches” (e.g. 64 samples per mini-batch)and only obtaining the gradients for a single mini-batch perstep of gradient descent, optimisation can proceed much morequickly [15]. A complete pass through the training set over

t mini-batches is termed an “epoch”. One epoch in batchgradient descent would involve a single step, whereas mini-batch gradient descent will make t steps per epoch. Mini-batch gradient descent is optimal compared to batch gradientdescent which is slow to make steps and SGD which cannottake advantage of vectorised code optimisations [15].

3) Faster gradient descent based optimisers: Faster gra-dient descent based methods exist with the broad aim ofsmoothing out the steps by effectively reducing the learningrate in dimensions where gradients are large and increasingthe learning rate in dimensions where gradients are small.Commonly used gradient descent alternatives include gradientdescent with momentum [16], Adagrad [17], RMSprop [18]and Adaptive moment estimation (Adam) [19].

C. Deep Learning

Deep learning is concerned with machine learning problemsbased on ANNs comprised of many layers and has been usedwith great success in the fields of computer vision, speechrecognition and translation as well as many other areas [20].In these fields, most other machine learning methods have beensurpassed by deep learning methods due to the very complexfunctions they can compute which can both approximatetraining labels and generalise well to unseen samples.

1) Application specific architectures: Convolutional neuralnetworks (CNNs) can drastically reduce the amount of pa-rameters that need to be learned by a deep neural networkby convolving the input features (and the outputs from laterhidden layers) with filters [13]. CNNs are commonly used incomputer vision tasks where the input dimensions are large.

Convolutions are useful for deep neural networks becausethey implement parameter sharing; the parameters encodedwithin the filter can be used in any position within the inputspace. Also, each value in the output matrix is the result ofanalysing only a small section of the input (where the filteroverlays) and therefore implements a sparsity of connections.These properties have the result of vastly reducing the numberof parameters required to go from one layer to the next. Con-volutions are also very useful for images where the positionof the subject is not always in the same place in the image(translation invariant).

Recurrent neural networks (RNNs) are designed such thatthe outputs of neurons are connected to themselves in acyclic fashion. They are superior to feed-forward ANNs forprocessing sequence data such as speech or text and time-series data.

Basic RNNs struggle to capture long-term dependences[13]. For example predictions of words at the start of asentence have little impact on predictions at the end of asentence. This is because a gradient propagated back fromthe end of the sentence to the start tends to vanish. TheGated Recurrent Unit (GRU) [21] and the more sophisticatedLong/Short Term Memory unit (LSTM) [22] make use ofmemory cells that act as gates choosing what information toaccumulate (remember) and which to forget. The parameters

defining the activation of these gates are all learnable by thenetwork.

D. Distributed deep learning

Deep networks perform best when trained on very largedatasets. The deep networks themselves can often incorporatemillions if not billions of parameters to express the weightsbetween connections (for example the AlexNet DNN achievedstate-of-the-art performance on the ImageNet dataset in 2012using 60 million parameters [23]). Both of these factorsrequire large sums of memory and compute capabilities. Toscale complex deep networks trained on lots of data requiresconcurrency across multiple CPUs or more commonly GPUs(most often in a local cluster). GPUs are optimised to performmatrix calculations and are well suited for the operationsrequired to compute activations across a DNN. Concurrencycan be achieved in a variety of ways as discussed below.

1) Concurrency: To train a large DNN efficiently acrossmultiple nodes, the calculations required in the forward andbackward passes need to parellelised. One method to achievethis is model parallelism which distributes collections ofneurons among the available compute nodes [24]. Each nodethen only needs to compute the activations of its own neurons,however must communicate regularly with nodes computingon connected neurons. The calculations on all nodes mustoccur synchronously and therefore computation proceeds atthe speed of the slowest node in the cluster. Another drawbackof the model parallelism approach is that the current mini-batch must be copied to all nodes in the compute cluster,further increasing communication costs within the cluster.

A second method resolves some of the issues of excessivecommunication between nodes by distributing one or morelayers on each node. This ensures that that each worker nodeonly needs to communicate with the one other node (a differentnode depending whether the computation is part of the forwardpass or the backward pass) [8]. However, this method stillrequires that data in the mini-batch be copied to all nodes inthe cluster.

The final method to achieve parallelism in training a largeDNN is termed data parallelism. This method partitions thetraining dataset and copies the subsets to each compute node inthe cluster. Each node computes forward and backward passesover the same model but using mini-batches drawn from itsown subset of the training data. The results of the weightupdates are then reduced on each iteration via MapReduce ormore commonly today, via message passing interface (MPI)[8]. Data parallelism is particularly effective as most opera-tions over mini-batches in SGD are independent. Thereforescaling the problem via sharding the data to many nodes isrelatively simple compared to the methods mentioned above.This method solves the issue of training with large amountsof data but requires that the model (and its parameters) fit inmemory on each node.

Hybrid parallelism combines two or all three of the con-currency schemes mentioned above to mitigate the drawbacks

associated with each and best support parallelism on the un-derlying hardware. DistBelief [24] achieves this by distributingthe data, network layers, and neurons within the same layeramong the available compute nodes, making use of all threeconcurrency schemes. Similarly, Project Adam [25] employsall three concurrency schemes but much more efficiently thanDistBelief (using significantly fewer nodes to achieve highaccuracy on the ImageNet1 22k data set)

2) Model consistency: Model consistency refers to the stateof a model when trained in a distributed manner [8] - aconsistent model should reflect the same parameter valuesamong compute nodes prior to each training iteration (or set oftraining iterations, sometimes referred to as a communicationround). In order to maintain model consistency, individualcompute nodes need to write updates to a global parameterserver [26]. The parameter server performs some form ofaggregation on the updates to synchronise a global model andthe parameters (for example, weights in a neural network) arethen shared with the individual compute nodes for the nextiteration/round of training.

There are several broad methods by which to train, updateand share a distributed deep learning model. Synchronousupdates occur when the parameter server waits for all computenodes to return parameters for aggregation. This method pro-vides high consistency between iterations/rounds of trainingas each node always receives up-to-date parameters but isnot hardware performant due to delays caused by the slowestcommunicating node. For example, a set of parameters wt attime t is shared among nc compute nodes. The compute nodeseach perform some number of forward and backward passesover the data available to them and compute the parametergradients ∆wc. These gradients are communicated to theparameter server, which in turn averages the gradients fromall workers and then updates the parameters for time t+ 1:

∆wt =1

nc

nc∑c=1

∆wc.

wt+1 = wt − η∆wt.

(6)

Asynchronous updates occur when the parameter servershares the latest parameters without waiting for all nodes toreturn parameter updates. This reduces model consistency asparameters can be overwritten and become stale due to slowcommunicating nodes. This method is hardware performanthowever as optimisation can proceed without waiting for allnodes to send parameter updates. The HOGWILD! algorithm[27] takes advantage of sparsity within the parameter updatematrix to asynchronously update gradients in shared memoryresulting in faster convergence. Downpour SGD [24] describesasynchronous updates as an additional mechanism to addstochasticity to the optimisation process resulting in greaterprediction performance.

In order to improve consistency using hardware performantasynchronous updates, the concept of parameter ‘staleness’

1http://www.image-net.org/

has been tackled by several works. The stale synchronousparallel (SSP) model [28] synchronises the global model oncea maximum staleness threshold has been reached but stillallows workers to compute updates on stale values betweenglobal model syncs. The impact of staleness in asynchronousSGD can also be mitigated by adapting the learning rate as afunction of the parameter staleness [29], [30]. As an examplea worker pushes an update at t = j to the parameter server att = i. The parameters in the global model at t = i are the mostup-to-date available. To prevent a stale parameter update fromoccurring, a staleness parameter τk for the k-th parameter iscalculated as τk = i− j. The learning rate used in Equation 6is modified as:

ηk =

{η/τk if τk 6= 0

η otherwise. (7)

3) Centralised vs decentralised learning: Centralised distri-bution of the model updates requires a parameter server (whichmay be a single machine or sharded across multiple machinesas in [24]). The global model tracks the averaged parametersaggregated from all the compute nodes that perform training(see Equation 6). The downside to this distribution method isthe high communication cost between compute nodes and theparameter server. Multiple shards can relieve this bottleneckto some extent, such that different workers read and writeparameter updates to specific shards [24], [25].

Heterogeneity of worker resources is handled well in cen-tralised distribution models. Distributed compute nodes intro-duce varying amounts of latency (especially when distributedgeographically as in [31]), yet training can proceed via asyn-chronous, or more efficiently, stale-synchronous methods [32].Heterogeneity is an inherent feature of federated learning.

Decentralised distribution of DNN training does not relyon a parameter server to aggregate updates from workersbut instead allows workers to communicate with one an-other, resulting in each worker performing aggregation ondata from the parameters it receives. Gossip algorithms thatshare updates between a fixed number of neighbouring nodeshave been applied to distributed SGD [33]–[35] in order toefficiently communicate/aggregate updates between all nodesin an exponential fashion similar to how disease is spreadduring an epidemic.

Communication can be avoided completely during training,resulting in many individual models represented by verydifferent parameters. These models can be combined (as anensemble [20]), however averaging the predictions from manymodels can slow down inference on new data. To tackle this,a process known as knowledge distillation can be used to traina single DNN (known as the mimic network) to emulate thepredictions of an ensemble model [36]–[38]. Unlabelled datais passed through the ensemble network to obtain labels onwhich the mimic network can be trained.

TABLE IA SUMMARY OF THE MAJOR CONTRIBUTIONS TO FEDERATED LEARNING RESEARCH

Ref Application(s) Year Major contribution

[39] Web analytics 2016 First description of federated optimisation and its application to a convex problem (logisticregression)

[9] Computer vision,NLP

2016 Description of federared averaging algorithm to improve the performance of the global model andreduce communication between the clients and server


2016 Methods for compressing weight updates and reducing the bandwidth required to perform federatedlearning

[41] Computer vision,Activityrecognition

2017 Application of multi-task learning in a federated setting and discussion of system challenges relevantto using federated learning on resource-constrained devices


2018 A demonstration of poisoning the shared global model in a federated learning setting

[43] Web analytics,Computer vision,Security

2018 A method to recognise adversarial clients and combat model poisoning in a federated learningsetting

[44] NLP 2018 Application of federated learning in a commercial setting (next word keyborard prediction inAndroid Gboard)

[45] Speech recognition 2018 Application of per-coordinate averaging (based on Adam) to federated learning to achieve fasterconvergence (in fewer communication rounds)

[46] Healthcare 2018 Applied federated learning to a healthcare application including further training after federatedlearning on client data (transfer learning)

[47] Computer vision 2018 A method of federated learning selecting clients with faster communication/greater resources toparticipate in each communication round achieving faster convergence

[12] Web analytics,Computer vision,Smart Energy

2018 Description of adaptive federated learning method suitable for deployment on resource-constraineddevices to optimally learn a shared model while maintaining a fixed energy budget

[48] Computer vision,Speech recognition

2018 Characterisation of how non-IID data reduces the model performance of federated learning andmethod for improving model performance

III. FEDERATED LEARNING

A. Overview

Federated learning makes use of data parallelism but ratherthan randomly partitioning a centralised dataset to many com-pute nodes, training occurs in the user domain on distributeddata owned by the individual users (often referred to as clients)[9]. The consequence of this is that user data is never shareddirectly with a third party orchestrating the training procedure.This greatly benefits users where the data might be consideredsensitive. Where data needs to be observed (for example,during the training operation), processing is handled on thedevice where the data resides (for example a smartphone).Once a round of training is completed on the device, themodel parameters are communicated to an aggregating server,such as a parameter server provided by a third party. Althoughthe training data itself is never disclosed to the third-party, itis a reasonable concern that something about an individual’straining data might be inferred by the parameter updates; thisis discussed further in section V.

Federated learning is vastly more distributed than traditionalapproaches for training machine learning models via dataparallelism. Some of the key differences are [9]:

1) Many contributing clients - federated learning needsto be scalable to many millions of clients.

2) Varying quantity of data owned by each user - someclients may train on only a few samples; others mayhave thousands.

3) Often very different data distributions between users- user data is highly personal to individuals and thereforethe model trained by each client represents non-IID(independent, identically distributed) data.

4) High latency between clients and aggregating service- updates are commonly communicated via the internetintroducing significant latency between communicationrounds.

5) Unstable communication between clients and aggre-gating service - client devices are likely to becomeunavailable during training due to their mobility, batterylife, or other reasons.

These distinguishing features of federated learning posechallenges above and beyond standard distributed learning.

Although this review focuses on deep learning in particular,many other ML algorithms can be trained via federatedlearning. Any ML algorithm designed to minimise an objective

function of the form shown in Equation 1 is well suitedto training via many clients (for example linear regressionand logistic regression). Some non-gradient based algorithmscan also be trained in this way, such as principal componentanalysis and k-mean clustering [49].

Federated optimisation was first suggested as a new set-ting for vastly and unevenly distributed machine learningby Konecny et al. [39] in 2016. In their work, the authorsfirst describe the nature of the federated setting (non-IIDdata, varying quantity of data per client etc). Additionally,the authors test a simple application of distributed gradientdescent against a federated modification of SVRG (a variancereducing variant of SGD [50]) over distributed data. FederatedSVRG calculates gradients and performs parameter updateson each of K nodes over the available data on each nodeand obtains a weighted average of the parameters from allclients. The performance of these algorithms are verified ona logistic regression language model using Google+ data todetermine whether a post will receive at least one comment.As logistic regression is a convex problem, the algorithms canbe benchmarked against a known optimum. Federated SVRGis shown to outperform gradient descent by converging to theoptimum within 30 rounds of communication.

Another approach to federating optimisation over manynodes is proposed by Smith et al. [41]. In this work, eachclient’s data distribution is modelled as a single task as partof a multi-task learning objective. In multi-task learning, alltasks are assumed to be similar and therefore each task canbenefit from learning derived from all the other tasks. Onthree different problems (based on human activity recognitionand computer vision), the federated multi-task learning settingoutperforms a centralised global setting and a strictly localisedsetting with lower average prediction errors. As part of thiswork [41], the authors also show that federated multi-tasklearning is robust to nodes temporarily dropping out duringlearning and when communication is reduced (by simulatingmore iterations on the client per communication round).

Federated learning (as described in [9] simplifies the fed-erated SVRG approach in [39] by modifying SGD for thefederated setting. McMahan et al. [9] provide two distributedSGD scenarios for their experiments: FederatedSGD and Fed-eratedAveraging. FederatedSGD performs a single step ofgradient descent on all the clients and averages the gradientson the server. The FederatedAveraging algorithm (shown inalgorithm 1) randomly selects a fraction of the clients toparticipate in each round of training. Each client k computesthe gradients on the current state of the global model wt andupdates the parameters wkt+1 in the standard fashion:

∀k,wkt+1 ← wt − η∇f(wt). (8)

All clients communicate their updates to the aggregatingserver, which then calculates a weighted average of the con-tributions from each client to update the global model:

Algorithm 1 FEDERATEDAVERAGING algorithm. C is thefraction of clients selected to participate in each communi-cation round. The K clients are indexed by k; B is the localmini-batch size, Pk is the dataset available to client k, E isthe number of local epochs, and η is the learning rate

1: procedure FEDERATEDAVERAGING . Run on server2: Initialise w0

3: for each round t = 1, 2, ... do4: m← max (C ·K, 1)5: St ← (random set of m clients)6: for each client k ∈ St do . In parallel7: wkt+1 ← CLIENTUPDATE(k,wt)8: end for9: wt+1 ←

∑Kk=1

nk

n wkt+1

10: end for11: end procedure

12: procedure CLIENTUPDATE(k,w) . Run on client k13: B ← (Split Pk into batches of size B)14: for each local epoch i from 1 to E do15: for batch b ∈ B do16: w ← w − η∇L(w; b)17: end for18: end for19: return w to server20: end procedure

wt+1 ←K∑k=1

nknwkt+1. (9)

Here, nk/n is the fraction of data available to the clientcompared to the available data to all participating clients.Clients can perform one or multiple steps of gradient descentbefore sending weight updates as orchestrated by the feder-ated algorithm. A diagram describing how federated learningproceeds in the FederatedAveraging scenario is provided inFigure 1

For supervised problems, user data needs to be labelled bythe user to be useful for training. This is demonstrated in[44] where a language model (an LSTM) is trained via manyclients on words typed on a mobile keyboard to predict thenext word. However, this data is clearly highly sensitive andshould not be sent to a central server directly and would benefitfrom training via federated learning. The training data for thismodel is automatically labelled when the user types the nextword. In cases where data is stored locally and already labelledsuch as medical health records, privacy is of great concern andeven sharing of data between hospitals may be prohibited [51].Federated learning can be applied in these settings to improvea shared global model that is more accurate than a modeltrained by each hospital separately. In [46], electronic healthrecords from 58 hospitals are used to train a simple neuralnetwork in a federated setting to predict patient mortality.The authors found that partially training the network using

Fig. 1. Schematic diagram showing how communication proceeds between the aggregating server and individual clients according to the FederatedAveragingprotocol. This procedure is iterated until the model converges or the model reaches some desired target metric (e.g. elapsed time, accuracy)

federated learning, followed by freezing the first layer andtraining only on the data available to each client resulted inbetter performing models for each hospital.

Machine learning algorithms are often trained under theassumption that the data is independent and identically dis-tributed (IID). This assumption is invalid in federated learningas the training data is decentralised with significantly differentdistributions and number of samples between participatingclients. Training using non-IID data has been shown toconverge much more slowly than IID data in a federatedlearning setting using the MNIST2 dataset (for handwrittendigit recognition), distributed between clients after havingbeen sorted by the target label [9]. The overall accuracyachieved by an ANN trained via federated learning can besignificantly reduced when trained on highly skewed non-IIDdata [48]. Yue et al. [48] show that accuracy can be improvedby sharing a small subset of non-private data between all theclients in order to reduce the variance between weight updatesof the clients involved in each communication round.

Due to the nature distributed client participation required forfederated learning, the protocol is susceptible to adversarialattacks. Bagdasaryan et al. [42] show a method for poisoningthe global model with an adversary acting as a client in thefederated learning setting. The adversary constructs an updatesuch that it survives the averaging procedure and replaces theglobal model. In this way, an adversary can poison the modelto return predictions specified by the attacker given certaininput features. Fung et al. [43] describe a method to defendagainst sybil-based adversarial attacks by measuring the simi-larity between client contributions during model averaging andfiltering attacker’s updates out.

2http://yann.lecun.com/exdb/mnist/

Much of the federated learning research to date focusses ontoy datasets designed for experimentation and benchmarking(CIFAR, MNIST). However, the use of real-world datasetsare beginning to appear. For example, there are efforts totrain shared models on distributed hospital-owned datasets [46]where sharing of the data itself is a privacy risk.

B. Communication-efficient federated learning

As highlighted above, distributed learning and federatedlearning in particular suffer from high latency between com-munication rounds. Additionally, given a sufficiently largeDNN, the number of parameters that need to be communicatedin each communication round from possibly many thousandsof clients becomes problematic in relation to data transmissionand traffic arriving at the aggregating server. There are severalapproaches to mitigating these issues as discussed in thissubsection.

The simplest method to reduce bandwidth use in commu-nicating model updates is simply to communicate less often.In [9], researchers experimented with the mini-batch size andnumber of epochs while training a CNN on the MNIST datasetas well as an LSTM trained on the complete works of WilliamShakespeare3 to predict the next text character after someinput of characters. Using FederatedAveraging, the authors[9] showed that increasing computation on the client betweencommunication rounds significantly reduced the number ofcommunication rounds required to converge to a threshold testaccuracy compared to a single epoch trained on all availabledata on the client (a single iteration of gradient descent).The greatest reduction in communication rounds was achievedusing a mini-batch size of 10 and 20 epochs on the client using

3https://www.gutenberg.org/ebooks/100

the CNN model (34.8x) and mini-batch size of 10 and 5 epochsusing the LSTM model (95.3x). As this method completelyeliminates many of the communication rounds, it should bepreferred over (or combined with) the compression methodsdiscussed next.

As the network connection used to communicate betweenthe clients and the aggregating server is generally asymmetric(download speed is generally significantly faster than uploadspeed), downloading the updated model from the aggregatingserver is less of a concern than uploading the updates fromthe clients. Despite this, compression methods exist to reducethe size of deep learning models themselves [52], [53].

The compression of the parameter updates on the clientprior to transmission is important to reduce the size of theoverall update but should still maintain a stable statistical meanwhen updates from all clients are averaged in the global sharedmodel. The following compression methods are not specific tofederated learning but have been experimented with in severalworks related to federated learning.

The individual weights that represent the parameters of anANN are generally encoded using a 32-bit floating point num-ber. Multiple works explore the effect of lossy compressionof the weights (or gradients) to 16-bit [54], 8-bit [55], oreven 1-bit [56] employing stochastic rounding to maintain theexpected value. The results of these experiments show thatas long as the quantisation error is carried forward betweenmini-batch computations, the overall accuracy of the modelisn’t significantly impacted.

Another method to compress the weight matrix itself is toconvert the matrix from a dense representation to a sparseone. This can be achieved by applying a random mask tothe matrix and only communicating the resulting non-zerosvalues along with the seed used to generate the random mask[40]. Using this approach combined with Federated Averaging[9] on the CIFAR-104 image recognition dataset, it has beenshown [40] that neither the rate of convergence nor the overalltest accuracy is significantly impacted, even when only 6.25%of the weights are transmitted during each communicationround. Also described in [40] is a matrix factorization methodwhereby the weight matrix is approximated by the product ofa randomly generated matrix, A and another matrix optimisedduring training, B. Only the matrix B (plus the random seedto generate A) needs to be transmitted in each communicationround. The authors show however that this method performssignificantly worse than the random mask method as thecompression ratio is increased.

Shokri & Shmatikov [11] propose an alternative sparsifica-tion method implemented in their “Selective SGD” (SSGD)procedure. This method transfers only a fraction of randomlyselected weights to each client from the global shared modeland only shares a fraction of updated weights back to theaggregating service. The updated weights selected to be com-municated are determined by either weight size (largest un-signed magnitude updates) or a random subset of values above

4https://www.cs.toronto.edu/ kriz/cifar.html

a certain threshold. The authors [11] show that a CNN trainedon the MNIST and Street View House Numbers (SVHN)5

datasets can achieve similar levels of accuracy sharing only10% of the updated weights and only a slight drop (1-2%)in accuracy by sharing only 1% of the updated weights.The paper also shows that the greater the number of usersparticipating in SSGD, the greater the overall accuracy.

Hardy et al. [57] take a similar approach to selective SGDbut select the largest gradients in each layer rather than thewhole weight matrix to better reflect changes throughout theDNN. Their algorithm, “AdaComp” also uses an adaptivelearning rate per parameter based on the staleness of theparameter to improve overall test accuracy on the MNISTdataset using a CNN. Most recently, Lin et al. [58] applygradient sparsification along with gradient clipping and mo-mentum correction during training to reduce communicationbandwidth by 270x and 600x without a significant drop inprediction performance on various ML problems in computervision, language modelling and speech recognition.

Leroy et al. [45] experiment with using moment-based aver-aging inspired by the Adam optimisation procedure, in place ofa standard weighted global average in FederatedAveraging [9].The authors train a CNN to detect a “wake word” (similar to“Hey Siri” to initialise the Siri program on iOS devices). Themoment-based averaging method acheives a target recall ofalmost 95% within 100 rounds of communication over 1400clients compared to only 30% using global averaging.

By selecting clients based on client resource constraints in amobile edge computing environment, Nishio & Yonetani [47]show that federated learning can be sped up considerably.As federated learning proceeds in a synchronous fashion,the slowest communicating node is a limiting factor in thespeed at which training can progress. In this work [47], targetaccuracies on the CIFAR-10 and Fashion-MNIST6 datasets areachieved in significantly less time than by using the Feder-atedAveraging algorithm in [9]. In a similar vein, Wang et al.[12], aim to take into account client resources during federatedlearning training. In this work an algorithm is designed tocontrol the tradeoff between local gradient updates and globalaveraging in order to optimally minimise the loss functionunder a fixed resource budget.

IV. PRIVACY PRESERVATION

Data collection for the purpose of learning something abouta population (for example in machine learning to discover afunction for mapping the data to target labels) can exposesensitive information about individual users. In machine learn-ing, this is often not the primary concern of the developer orresearcher creating the model, yet is extremely important forcircumstances where personally sensitive data is collected anddisseminated in some form (e.g. via a trained model). Privacyhas become even more important in the age of big data (datawhich is characterised by its large volume, variety and velocity

5http://ufldl.stanford.edu/housenumbers/6https://www.kaggle.com/zalando-research/fashionmnist

[59]). As businesses gather increasing amounts of data aboutusers, the risk of privacy breaches via controlled data releasesgrows.

This review focuses on the protection of an individual’sprivacy via controlled data releases (such as from personaldata used to train a machine learning model) and does notconsider privacy breaches via hacking and theft which is aseparate issue related to data security.

Privacy is upheld as a human right in many countries viaArticle 12 of the Universal Declaration of Human Rights [60],Article 17 of the International Covenant on civil and politicalrights [61], and Article 8 of the European Convention onHuman Rights [62]. In Europe rigorous legislation with respectto data protection via the General Data Protection Regulation[63] safeguards data privacy such that users are given the factsabout how and what data is collected about them and how itused and by whom. Despite these rights and regulations, dataprivacy is difficult to maintain and breaches of privacy viacontrolled data releases occur often.

Privacy can be preserved in a number of ways, yet it isimportant to maintain a balance between the level of privacyand utility of the data (along with some consideration forthe computational complexity required to preserve privacy).A privacy mechanism augments the original data in order toprevent a breach of personal privacy (i.e. an individual shouldnot be able to be recognised in the data). For example, aprivacy mechanism might use noise to augment the result ofa query on the data [64]. Adding too much noise to a resultmight render it meaningless and adding too little noise mightleak sensitive information. The privacy/utility tradeoff is aprimary concern of the privacy mechanisms to be discussedin the next subsection.

A. Privacy preserving methods

The privacy preserving methods discussed in this sectioncan be described as either suppressive or perturbative [65].Suppressive methods include removal of attributes in thedata, restricting queries via the privacy mechanism, aggrega-tion/generalisation of data attributes and returning a sampledversion of the original data. Perturbative methods include noiseaddition to the result of a query on the data or rounding ofvalues in the dataset.

1) Anonymisation: Anonymisation or de-identification isachieved by removing any information that might identifyan individual within a dataset. Ad-hoc anonymisation mightreasonably remove names, addresses, phone numbers etc andreplace each user’s record(s) with a pseudonym value to actas an identifier under the assumption that individuals cannotbe identified within the altered dataset. However, this leavesthe data open to privacy attacks known as linkage attacks[65]. In the presence of auxiliary information, linkage attacksallow an adversary to re-identify individuals in the otherwiseanonymous dataset.

Several famous examples of such linkage attacks exist. AnMIT graduate, Latanya Sweeney, purchased voter registrationrecords for the town of Cambridge, Massachusetts and was

able to use combinations of attributes (ZIP code, gender anddate of birth) known as a quasi-identifier to identify the thengovernor of Massachusetts, William Weld. When combinedwith state-released anonymised medical records, Sweeney wasable to identify his medical information from the data release[66]. As part of a machine learning competition known as theNetflix Prize7 launched in 2006, Netflix released a randomsample of pseudo-anonymised movie rating data. Narayanan& Shmatikov [67] were able to show that using relatively fewpublicly published ratings by IMDb8, all the ratings in theNetflix data for the same user could be revealed. Lastly, in2014, celebrities in New York were able to be tracked bycombining taxi route data released via freedom of informationlaw, a de-hashing of the taxi license numbers (which werebased on md5) and with geo-tagged photos of the celebritiesentering/exiting certain taxies [68].k-anonymity was proposed by Sweeney [69] to tackle the

challenge of linkage attacks on anonymised datasets. Using k-anonymity, data is suppressed such that k−1 or more individu-als possess the same attributes used to create a quasi-identifier.Therefore, an identifiable record in a auxiliary dataset wouldlink to multiple records in the anonymous dataset. Howeverk-anonymity cannot defend against linkage attacks where asensitive attribute is shared among a group of individuals withthe same quasi-identifier. l-diversity builds on k-anonymity toensure that there is diversity within any group of individualssharing the same quasi-identifier [70]. t-closeness builds onboth these methods to preserve the distribution of sensitiveattributes among any group of individuals sharing the samequasi-identifier [71]. All the methods suffer however whenan adversary possesses some knowledge about the sensitiveattribute. Research related to improving k-anonymity basedmethods has mostly been abandoned in the literature in pref-erence of methods that offer more rigorous privacy guarantees(such as differntial privacy)

2) Encryption: Anonymisation presents several difficultchallenges in order to provide statistics about data withoutdisclosing sensitive information. Encrypting data providesbetter privacy protection but the ability to perform usefulstatistical analysis on encrypted data requires specialist meth-ods. Homomorphic encryption [72] allows for processing ofdata in its encrypted form. Earlier efforts (termed “SomewhatHomomorphic Encryption”) allowed for simple addition andmultiplication operations on encrypted data [73], but wereshortly followed by Fully Homomorphic Encryption allowingfor any arbitrary function to be applied to data in ciphertextform to yield an encrypted result [72].

Despite the apparent advantages of homomorphic encryp-tion to provide privacy to individuals over their data whilstallowing a third party to perform analytics on it, the com-putational overhead required to perform such operations isvery large [74], [75]. IBM’s homomorphic library implemen-

7https://www.netflixprize.com/index.html8https://www.imdb.com/

tation9 runs some 50 million times slower than performingcalculations on plaintext data [76]. Due to this computationaloverhead, applying homomorphic encryption to training onlarge-scale machine learning data is currently impractical [77].Several projects make use of homomorphic encryption forinference on encrypted private data [74], [78]

Secure multi-party computation (SMC) [79] can also beadopted to compute a function on private data owned by manyparties such that no party learns anything about others’ data -only the output of the function. Many SMC protocols are basedon Shamir’s secret sharing [80] which splits data into n piecesin such away that only k pieces are required to reconstruct theoriginal data (k − 1 pieces reveal nothing about the originaldata). For example a value x is shared with multiple servers (asxA, xB ...) via an SMC protocol such that the data can only bereconstructed if the shared pieces on k servers are known [81].Various protocols exist to compute some function over the dataheld on the different servers via rounds of communication,however the servers involved are assumed to be trustworthy.

3) Differential privacy: Differential privacy provides anelegant and rigorous mathematical measure of the level ofprivacy afforded by a privacy preserving mechanism. A dif-ferentially private privacy preserving mechanism acting onvery similar datasets will return statistically indistinguishableresults. More formally: Given some privacy mechanism Mthat maps inputs from domain D to outputs in range R , itis “almost” equally likely (by some multiplicative factor ε)for any subset of outputs S ⊆ R to occur, regardless of thepresence or absence of a single individual in 2 neighbouringdatasets d and d′ drawn from D (differing by a singleindividual) [64]

Pr[M(d) ∈ S] ≤ eεPr[M(d′) ∈ S]. (10)

Here, d and d′ are interchangeable with the same outcome.This privacy guarantee protects individuals from being identi-fied within the dataset as the result from the mechanism shouldbe essentially the same regardless of whether the individualappeared in the original dataset or not. Differential privacyis an example of a perturbative privacy preserving method, asthe privacy guarantee is achieved by the adding of noise to thetrue output. This noise is commonly drawn from a Laplaciandistribution [64] but can also be drawn from a exponentialdistribution [82] or via the novel staircase mechanism [83]that provides greater utility compared to laplacian noise forthe same ε. The above description of differential privacy isoften termed ε-differential privacy or strict differential privacy.

The amount of noise required to satisfy ε-differential privacyis governed by ε and the sensitivity of the statistic function Qdefined by [64]:

∆Q = max(||Q(d)−Q(d′)||1). (11)

This maximum is evaluated over all neighbouring datasetsin the set D differing by a single individual. The output of the

9https://github.com/shaih/HElib

mechanism using noise drawn from the Laplacian distributionis then:

M(d) = Q(d) + Laplace(0,

∆Q

ε

). (12)

A relaxed version of differential privacy known as (ε, δ)-differential privacy [84] provides greater flexibility in design-ing privacy preserving mechanisms and greater resistance toattacks making use of auxiliary information [82]:

Pr[M(d) ∈ S] ≤ eεPr[M(d′) ∈ S] + δ. (13)

Whereas ε-differential privacy provides a privacy guaranteeeven for results with extremely small probabilities, the δ in(ε, δ)-differential privacy accounts for the small probabilitythat the privacy guarantee of ordinary ε-differential privacy isbroken.

The Gaussian mechanism is commonly used to add noiseto satisfy (ε, δ)-differential privacy [85], but instead of the L1norm used in Equation 11, the noise is scaled to the L2 norm:

∆2Q = max(||Q(d)−Q(d′)||2). (14)

The following mechanism then satisfies (ε, δ)-differentialprivacy (given ε, δ ∈ (0, 1)):

M(d) = Q(d) +∆2Q

εN(0, 2 ln (1.25/δ)

). (15)

ε is additive for multiple queries [82] and therefore an ε-budget should be designed to protect private data when queriedmultiple times. Practically, this means that any differentialprivacy based system must keep track of who queries whatand how often to ensure that some predefined ε-budget isnot surpassed. In a machine learning setting a method ofaccounting for the accumulated privacy loss over trainingiterations [10] needs to be employed to maintain an ε-budget.

Accumulated knowledge as described above is one of theweaknesses of differential privacy to keep sensitive data private[82]. Another is collusion. If multiple users collude in thequerying of the data (sharing the results of queries with oneanother) the ε-budget for any single user might be breached.Finally, suppose an ε-budget is assigned for each individualquery; a user making queries on correlated data will use onlythe budget for each query, yet may be able to gain moreinformation due to the fact that two quantities are correlated(e.g. income and rent). Clearly, large ε (or large ε-budgets)introduce greater risk of privacy breaches than small onesbut selecting an appropriate ε is a non-trivial issue. Lee andClinton [86] discuss the means by which ε might be selectedfor a given problem but identify that in order to do so thedataset and the queries on the dataset should be known aheadof time.

Noise addition can be applied in two separate scenarios.Given a trusted data curator, noise can be added to querieson a static dataset, introducing only minimal noise per query.This can be considered as a global privacy setting. Converselyin a local privacy setting, no such trusted curator exits. Local

differential privacy applies when noise is added to each samplebefore collection/aggregation. For example, the randomisedresponse technique [87] allow participants to answer a questiontruthfully or randomly based on the flip of a coin. Eachparticipant therefore has plausible deniability for their answer,yet statistics can still be estimated on the population givenenough data (and a fair coin flip). Each individual sample isextremely noisy in the local case due to the high vulnerabilityof a single sample being leaked, however, aggregated involume, the noise can be filtered out to an extent to revealpopulation statistics. Federated learning is another example ofwhere local differential privacy is useful [88]. Adding noise tothe updates during training rounds on local user data prior toaggregation by an untrusted parameter server provides greaterprivacy to the user and their contributions to a global model(discussed further in section V)

Limited examples of practical applications using differentialprivacy exist outside of academia. Apple implemented differ-ential privacy in its iOS 10 operating system for the iPhone[89] in order collect statistics on emoji suggestions and safaricrash reports [90]. Google also collect usage statistics for theChrome internet browser using differential privacy via multiplerounds of the randomised response technique coupled witha bloom filter to encode the domain names of sites a userhas visited [91]. Both these applications use local differentialprivacy to protect the individual’s privacy but rely on largenumbers of participating users in order to determine accurateoverall statistics.

Future research and applications of differential privacy arelikely to focus on improving utility whilst retaining goodprivacy guarantees in order for greater adoption by the ITindustry.

V. PRIVACY PRESERVATION IN FEDERATED LEARNING

Federated learning already increases the level of privacyafforded to an individual over traditional machine learning on astatic dataset. It mitigates the storage of sensitive personal databy a third party and prevents a third party from performinglearning tasks on the data for which the individual had notinitially given permission. Additionally, inference does notrequire that further sensitive data be sent to a third party asthe global model is available to the individual on their ownprivate device [11]. Despite these privacy improvements, theweight/gradient updates uploaded by individuals may revealinformation about the user’s data, especially if certain weightsin the weight matrix are sensitive to specific features or valuesin the individual’s data (for example, specific words in alanguage prediction model [92]). These updates are availableto any client participating in the federated learning as well asthe aggregating server.

Bonawitz et al. [93] show that devices participating infederated learning can also act as parties involved in SMC toprotect the privacy of all user’s updates. In their “Secure Ag-gregation” protocol, the aggregating server only learns aboutclient updates in aggregate. Similarly, the ∝MDL protocoldescribed in [95] uses SMC but also encrypts the gradients

on the client using homomorphic encryption. The summationof the encrypted gradients over all participating clients givesan encrypted global gradient, however this summation resultcan only be decrypted once a threshold number of clients haveshared their gradients. Therefore, again, the server can onlylearn about client updates in aggregate, preserving the privacyof individual contributions.

Researchers at Google have recently described the high-level design of a production-ready federated learning frame-work [96] based on Tensorflow10. This framework includesSecure Aggregation [93] as an option during training.

Applying SMC to federated learning suffers from increasedcommunication and greater computational complexity in theaggregation process (both for the client and the server). Addi-tionally, the fully trained model available to clients after thefederated learning procedure may still leak sensitive data aboutspecific individuals as described earlier. Adversarial attackson federated learning models can be mitigated by inspectingand filtering out malicious client updates [43]. However, theSecure Aggregation protocol [93] prevents the inspection ofindividual updates and therefore cannot defend against suchpoisoning attacks [42] in this way.

While SMC achieves privacy through increased computa-tional complexity, differential privacy trades off utility forincreased privacy. Additionally differential privacy protectsindividual’s contributions to the model during training andonce the model is fully trained. Differential privacy has beenapplied in multiple cases to mitigate the issue of publishingsensitive weight updates during communication rounds in afederated learning setting. Shokri & Shmatikov [11] describea communication-efficient method for federated learning ofa deep learning model tested on the MNIST and SVHNdatasets. They select only a fraction of the local gradientupdates to share with a central server but also experimentwith adding noise to the updates to satisfy differential privacyand protect the contributions of individuals to the globalmodel. An ε-budget is divided and spent on selecting gradientsabove a certain threshold and on publishing the gradients.Judging the sensitivity of SGD is achieved by bounding thegradients between [−γ, γ] (γ is set to some small number).Laplacian noise is generated using this sensitivity and addedto the updates prior to selection/publishing. The authors showthat their differentially private method outperforms standalonetraining (training performed by each client on their own dataalone) and approaches the performance of SGD on a non-private static dataset given that enough clients participate ineach communication round.

Abadi et al. [10] apply a differentially private SGD mech-anism to train on the MNIST and CIFAR-10 image datasets.They show they can acheive 97% accuracy on MNIST (1.3%worse than non-differentially private baseline) and 73% ac-curacy on CIFAR-10 (7% worse than non-differentially pri-vate baseline) using a modest neural network and principlecomponent analysis to reduce the dimensionality of the input

10https://www.tensorflow.org/

TABLE IIA SUMMARY OF THE MAJOR CONTRIBUTIONS TO FEDERATED LEARNING RESEARCH WITH A FOCUS ON PRIVACY ENHANCING MECHANISMS (DP =

DIFFERENTIAL PRIVACY, HE = HOMOMORPHIC ENCRYPTION, SMC = SECURE MULTI-PARTY COMPUTATION)

Ref Application Year Major contribution Privacymecha-nism

Privacy details

[11] Computervision

2015 Description of a selective distibuted gradient descentmethod to reduce communication and the applicationof differential privacy to protect the model parameterupdates

DP Batch-level DP, ε-DP (Laplace mecha-nism)

[10] Computervision

2016 Description of an efficient accounting method for ac-cumulating privacy losses while training a DNN withdifferential privacy

DP Batch-level DP, (ε-δ)-DP (Gaussianmechanism)

[93] N/a 2017 New method to provide secure multi-party computationspecifically tailored towards federated learning

SMC Secure aggregation protocol evaluatesthe average gradients of clients onlywhen a sufficient number send updates

[88] Computervision

2017 Method for providing user-level differential privacy forfederated learning with only small loss in model utility

DP User-level DP, (ε-δ)-DP (Gaussianmechanism)

[92] NLP 2017 Method for providing user-level differential privacy forfederated learning without degrading model utility

DP User-level DP, (ε-δ)-DP (Gaussianmechanism)

[94] Computervision

2017 Demonstration of an attack method on the global modelusing a generative adversarial network, effective evenagainst record/batch-level DP

DP Attack tested against record/batch-levelDP (implemented using [11])

[95] Computervision

2017 Method for encrypting user updates during distributedtraining, decryptable only when many clients haveparticipated in the distributed learning objective

HE,SMC

Gradient updates are encrypted us-ing homomorphic encryption. Aggre-gate server obtains average gradientover all workers but can only decryptthis result once a certain number ofupdates have been aggregated

[96] N/a 2019 Description of a full-scale production-ready federatedlearning system (focusing on mobile devices)

SMC Optionally makes use of the Secureaggregation protocol in [93]

space. This is achieved using an (ε, δ)-differential privacy of(8, 10−5). The authors also introduce a privacy accountantto monitor the accumulated privacy loss over all trainingoperations based on moments of the privacy loss randomvariable. The authors point out that the privacy loss is minimalfor such a large number of parameters and training examples.

Geyer et al. [88] make use of the moments privacy ac-countant from [10] and evaluate the accumulated δ duringtraining. Once the accumulated δ reaches a given threshold,training is halted. Intuitively, training is halted once the riskof the privacy guarantee being broken becomes too probable.This method of federated learning protects the privacy of anindividual’s participation in training over their entire localdataset as opposed to a single data point during training as in[10]. The authors show that with a sufficiently large numberof clients participating in the federated optimisation, only aminor drop in performance is recorded whilst maintaining agood level of privacy over the individual’s data. Similarly,McMahan et al. [92] apply user-level differential privacy(noise is added using sensitivity measured at the user-levelrather than example of mini-batch level) via the momentsprivacy accountant introduced in [10].

A method for attacking deep learning models trained via

federated learning has been proposed in [94]. This approachinvolves a malicious user participating in federated trainingwhose alternative objective is to train a generative adversar-ial network (GAN) to generate realistic examples from theglobally shared model during training. The authors show thateven a model trained with differentially private updates issusceptible to the attack but that it could be defended againstwith user-level or device-level differential privacy such as thatwhich is described in [88] and [92].

An alternative method to perform machine learning onprivate data is via a knowledge distillation-like approach.Private Aggregation of Teacher Ensembles (PATE) [97] trainsa student model (which is published and used for inference)using many teacher models in an ensemble. Neither thesensitive data available to the teacher models, nor the teachermodels themselves are ever published. The teacher modelsonce trained on the sensitive data are then used to label publicdata in a semi-supervised fashion via voting for the predictedclass. The votes cast by the teachers have noise generatedvia a Laplacian distribution added to preserve the privacy oftheir predictions. This approach requires that public data isavailable to train the student model, however shows betterperformance than [10] and [11] whilst maintaining privacy

guarantees ((ε, δ)-differential privacy of (2.04, 10−5) and(8.19, 10−6) on the MINST and SVHN datasets respectively).Further improvements to PATE show that the method can scaleto large multi-class problems [98].

VI. CHALLENGES AND FUTURE DIRECTIONS

In this section, we identify and outline some promising areasto develop federated learning research.

1) Optimal model architecture/hyperparameters: Federatedlearning precludes seeing the data that a model is trained on.On a traditionally centralised dataset, a deep learning archi-tecture and hyperparameters can be selected via a validationstrategy. However to follow the same approach in federatedlearning to find an optimal architecture or set the optimal hy-perparameters to produce good models would require trainingmany models on user devices (possibly incurring unacceptableamounts of battery power and bandwidth). Therefore, moreresearch is required to tackle this specific problem, unique tofederated learning.

2) Continual learning: Training a machine learning modelis an expensive and time-consuming task and this can besignificantly worse in the federated learning setting. As datadistributions evolve over time, a trained model’s performancedeteriorates. To avoid the cost of federated training manytimes over, research into methods for improving how a modellearns is congruent to the federated learning objective overtime. Methods such as meta-learning, online learning andcontinual learning will be important here which will havespecific challenges unique to the distributed nature of federatedlearning.

3) Better privacy preserving methods: As seen in thisreview, there is an observable tradeoff between the perfor-mance of a model and the privacy that is afforded to auser. Further research is ongoing into differential privacyaccounting methods that introduce less noise into the model(thus improving utility) for the same level of privacy (as judgedby the ε parameter). Likewise, further research is requiredto vastly reduce the computational burden of methods suchhomomorphic encryption and secure multi-part computation inorder for them to become common-use methods for preservingprivacy for large-scale machine learning tasks.

4) Federated learning combined with fog computing: Re-ducing the latency between rounds of training in federatedlearning is desirable to train model quickly. Fog computingnodes could feasibly be leveraged as aggregating serversto remove the round-trip communication between clients tocloud servers in the averaging step of federated learning. Fogcomputing could also bring other benefits, such as sharing thecomputational burden by hierarchically averaging many largeclient models.

5) Greater variety of machine learning problems: Currentfederated learning research understandably focusses on CNNsand RNNs applied to benchmarking datasets in order to simplyunderstand the behaviour of federated learning and to easilycompare results from different research groups. However fed-erated learning also needs to be characterised more thoroughly

on real-world datasets and for a variety of different machinelearning tasks. Additionally, different deep learning networkssuch as autoencoders and generative adversarial networks havenot yet been trained in a federated setting in the literature.

6) Federated learning on low power devices: Training deepnetworks on resource constrained and low power devicesposes specific challenges for federated learning. Much of theresearch into federated learning focusses on mobile devicessuch as smartphones with abundant compute, storage andpower capabilities. As such, new methods are required forreducing the amount of work individual devices need to doto contribute to training (perhaps using the model parallelismapproach seen in [24] or training only certain deep networklayers on subsets of devices.)

VII. CONCLUSION

Deep learning has shown impressive successes in the fieldsof computer vision, speech recognition and language mod-elling. With the exploding increase in deployments of IoTdevices, naturally, deep learning is starting to be appliedat the edge of the network on mobile and resource-limitedembedded devices. This environment however presents diffi-cult challenges for training deep models due to their energy,compute and memory requirements. Beyond this, a model’sutility is strictly limited to the data available to the edge device.Allowing machines close to the edge of the network to trainon data produced by edge devices (as in fog computing) risksprivacy breaches of such data. Federated learning has beenshown to be a good solution to improve deep learning modelswhile maintaining the privacy of the raw data.

Federated learning is a very new field of research but hasgreat potential for improving the privacy of training data andgiving users control of how their data is used by third par-ties. Combining federated learning with privacy mechanismssuch as differential privacy further secures user data fromadversaries with the inclination and means to reverse-engineerparameter updates in distributed SGD procedures. Differentialprivacy as applied to machine learning is also in its infancy andchallenges remains to provide good privacy guarantees whilstsimultaneously limiting the required communication costs ina federated setting.

The intersection of federated learning, differential privacyand IoT data represents an open area of research. Performingdeep learning efficiently on resource-constrained devices whilepreserving privacy and utility poses a real challenge. Addi-tionally, the nature of IoT data as opposed to internet data forprivate federated learning hasn’t yet been tested sufficiently.Much of the literature around federated learning focuses ontesting methods on benchmark datasets such as MNIST andCIFAR-10. The IoT on the other hand is often representedby highly skewed non-IID data with high temporal variability.This is a challenge that needs to be overcome for federatedlearning to flourish in edge environments.

ACKNOWLEDGMENTS

This work is partly supported by the SEND project (grantref.32R16P00706) funded by ERDF and BEIS. We would also

like to thank Assured Systems Ltd (UK) as an industry partnerto conduct this review.

REFERENCES

[1] Gartner. (2017, Feb.) Gartner Says 8.4 Billion Connected ”Things”Will Be in Use in 2017, Up 31 Percent From 2016. [Online].Available: https://www.gartner.com/en/newsroom/press-releases/2017-02-07-gartner-says-8-billion-connected-things-will-be-in-use-in-2017-up-31-percent-from-2016

[2] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing andits role in the internet of things,” in SIGCOMM 2012 MCC workshop.New York, New York, USA: ACM, Aug. 2012, pp. 13–16.

[3] Y. Ai, M. Peng, and K. Zhang, “Edge computing technologies forInternet of Things: a primer,” Digital Communications and Networks,vol. 4, no. 2, pp. 77–86, Apr. 2018.

[4] L. Bittencourt, R. Immich, R. Sakellariou, N. Fonseca, E. Madeira,M. Curado, L. Villas, L. DaSilva, C. Lee, and O. Rana, “The Internet ofThings, Fog and Cloud continuum: Integration and challenges,” Internetof Things, vol. 3–4, pp. 134–155, Oct. 2018.

[5] OpenFog Consortium. (2017, Feb.) OpenFog Ref-erence Architecture for Fog Computing . [On-line]. Available: https://www.openfogconsortium.org/wp-content/uploads/OpenFog Reference Architecture 2 09 17-FINAL.pdf

[6] F. Bonomi, R. Milito, P. Natarajan, and J. Zhu, “Fog Computing: APlatform for Internet of Things and Analytics,” in Big Data and Internetof Things: A Roadmap for Smart Environments. Cham: Springer, Cham,2014, pp. 169–186.

[7] H. Li, K. Ota, and M. Dong, “Learning IoT in Edge: Deep Learning forthe Internet of Things with Edge Computing,” IEEE Network, vol. 32,no. 1, pp. 96–101, 2018.

[8] T. Ben-Nun and T. Hoefler, “Demystifying Parallel and DistributedDeep Learning: An In-Depth Concurrency Analysis,” arXiv.org, p.arXiv:1802.09941, Feb. 2018.

[9] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-Efficient Learning of Deep Networks from Decentral-ized Data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.

[10] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Tal-war, and L. Zhang, “Deep Learning with Differential Privacy,” in the2016 ACM SIGSAC Conference. ACM, Oct. 2016, pp. 308–318.

[11] R. Shokri and V. Shmatikov, “Privacy-Preserving Deep Learning,” in the22nd ACM SIGSAC Conference. ACM, Oct. 2015, pp. 1310–1321.

[12] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, andK. Chan, “Adaptive Federated Learning in Resource Constrained EdgeComputing Systems,” arXiv.org, p. arXiv:1804.05271, Apr. 2018.

[13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, 1st ed.Cambridge, Massachusetts: MIT Press, 2016.

[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internalrepresentations by error propagation,” in Parallel distributed processingexplorations in the microstructure of cognition. Cambridge, Mas-sachusetts: MIT Press, Jan. 1986, pp. 318–362.

[15] S. Ruder, “An overview of gradient descent optimization algorithms,”arXiv.org, p. arXiv:1609.04747, Sep. 2016.

[16] N. Qian, “On the momentum term in gradient descent learning algo-rithms,” Neural Networks, vol. 12, no. 1, pp. 145–151, Jan. 1999.

[17] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methodsfor Online Learning and Stochastic Optimization,” Journal of MachineLearning Research, vol. 12, pp. 2121–2159, Feb. 2011.

[18] G. E. Hinton, Lecture 6e rmsprop: Divide the gradient by a runningaverage of its recent magnitude . Coursera: Coursera, 2012.

[19] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”arXiv.org, p. arXiv:1412.6980, Dec. 2014.

[20] Y. Lecun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol.521, no. 7, pp. 436–444, May 2015.

[21] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning Phrase Representations usingRNN Encoder–Decoder for Statistical Machine Translation,” in Pro-ceedings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP). Stroudsburg, PA, USA: Associationfor Computational Linguistics, 2014, pp. 1724–1734.

[22] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” NeuralComputation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.

[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classificationwith Deep Convolutional Neural Networks,” Communications of theAcm, vol. 60, no. 6, pp. 84–90, Jun. 2017.

[24] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z.Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Largescale distributed deep networks,” in Advances in neural informationprocessing systems. Curran Associates Inc., Dec. 2012, pp. 1223–1231.

[25] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “ProjectAdam: Building an Efficient and Scalable Deep Learning TrainingSystem.” in OSDI, 2014, pp. 571–582.

[26] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, and A. Ahmed,“Scaling Distributed Machine Learning with the Parameter Server.”OSDI, vol. 14, pp. 583–598, 2014.

[27] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A Lock-FreeApproach to Parallelizing Stochastic Gradient Descent,” in Advancesin neural information processing systems, 2011, pp. 693–701.

[28] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A.Gibson, G. Ganger, and E. P. Xing, “More Effective Distributed ML viaa Stale Synchronous Parallel Parameter Server,” in Advances in neuralinformation processing systems, 2013, pp. 1223–1231.

[29] A. Odena, “Faster Asynchronous SGD,” arXiv.org, p. arXiv:1601.04033,Jan. 2016.

[30] W. Zhang, S. Gupta, X. Lian, and J. Liu, “Staleness-aware Async-SGD for Distributed Deep Learning,” in Proceedings of the Twenty-FifthInternational Joint Conference on Artificial Intelligence, Nov. 2015, pp.2350–2356.

[31] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, andP. B. Gibbons, “Gaia: Geo-Distributed Machine Learning ApproachingLAN Speeds.” in NSDI, 2017, pp. 629–647.

[32] J. Jiang, B. Cui, C. Zhang, and L. Yu, “Heterogeneity-aware DistributedParameter Servers,” in the 2017 ACM International Conference. NewYork, New York, USA: ACM Press, 2017, pp. 463–478.

[33] J. Daily, A. Vishnu, C. Siegel, T. Warfel, and V. Amatya, “Gossip-GraD: Scalable Deep Learning using Gossip Communication basedAsynchronous Gradient Descent,” arXiv.org, p. arXiv:1803.05880, Mar.2018.

[34] P. H. Jin, Q. Yuan, F. Iandola, and K. Keutzer, “How to scale distributeddeep learning?” arXiv.org, p. arXiv:1611.04581, Nov. 2016.

[35] S. Sundhar Ram, A. Nedic, and V. V. Veeravalli, “Asynchronous gossipalgorithms for stochastic optimization,” in 2009 International Confer-ence on Game Theory for Networks (GameNets). IEEE, 2009, pp.80–81.

[36] J. Ba and R. Caruana, “Do Deep Nets Really Need to be Deep?” inAdvances in neural information processing systems, 2014, pp. 2654–2662.

[37] Y. Chebotar and A. Waters, “Distilling Knowledge from Ensembles ofNeural Networks for Speech Recognition,” in Interspeech 2016. ISCA,Sep. 2016, pp. 3439–3443.

[38] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in aNeural Network,” arXiv.org, p. arXiv:1503.02531, Mar. 2015.

[39] J. Konecny, H. B. McMahan, D. Ramage, and P. Richtarik, “FederatedOptimization: Distributed Machine Learning for On-Device Intelli-gence,” arXiv.org, p. arXiv:1610.02527, Oct. 2016.

[40] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, andD. Bacon, “Federated Learning: Strategies for Improving Communica-tion Efficiency,” arXiv.org, p. arXiv:1610.05492, Oct. 2016.

[41] V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated Multi-Task Learning,” in Advances in neural information processing systems,May 2017, p. arXiv:1705.10467.

[42] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “HowTo Backdoor Federated Learning,” arXiv.org, p. arXiv:1807.00459, Jul.2018.

[43] C. Fung, C. J. M. Yoon, and I. Beschastnikh, “Mitigating Sybils inFederated Learning Poisoning,” arXiv.org, p. arXiv:1808.04866, Aug.2018.

[44] A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner,C. Kiddon, and D. Ramage, “Federated Learning for Mobile KeyboardPrediction,” arXiv.org, Nov. 2018.

[45] D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, and J. Dureau, “Feder-ated Learning for Keyword Spotting,” arXiv.org, p. arXiv:1810.05512,Oct. 2018.

[46] D. Liu, T. Miller, R. Sayeed, and K. D. Mandl, “FADL: Federated-Autonomous Deep Learning for Distributed Electronic Health Record,”arXiv.org, p. arXiv:1811.11400, Nov. 2018.

https://www.gartner.com/en/newsroom/press-releases/2017-02-07-gartner-says-8-billion-connected-things-will-be-in-use-in-2017-up-31-percent-from-2016



https://www.openfogconsortium.org/wp-content/uploads/OpenFog_Reference_Architecture_2_09_17-FINAL.pdf

https://www.openfogconsortium.org/wp-content/uploads/OpenFog_Reference_Architecture_2_09_17-FINAL.pdf

[47] T. Nishio and R. Yonetani, “Client Selection for Federated Learn-ing with Heterogeneous Resources in Mobile Edge,” arXiv.org, p.arXiv:1804.08333, Apr. 2018.

[48] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “FederatedLearning with Non-IID Data,” arXiv.org, p. arXiv:1806.00582, Jun.2018.

[49] Y. Liang, M. F. Balcan, and V. Kanchanapally, “Distributed PCA andk-means clustering,” in The Big Learning Workshop at NIPS, 2013.

[50] R. Johnson and T. Zhang, “Accelerating Stochastic Gradient DescentUsing Predictive Variance Reduction,” in Proceedings of the 26thInternational Conference on Neural Information Processing Systems -Volume 1. USA: Curran Associates Inc., 2013, pp. 315–323.

[51] R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deeplearning for healthcare: review, opportunities and challenges,” Briefingsin Bioinformatics, vol. 19, no. 6, pp. 1236–1246, May 2017.

[52] S. Han, H. Mao, and W. J. Dally, “Deep Compression: CompressingDeep Neural Networks with Pruning, Trained Quantization and HuffmanCoding,” arXiv.org, p. arXiv:1510.00149, Oct. 2015.

[53] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both Weights andConnections for Efficient Neural Network,” in Advances in neuralinformation processing systems, 2015, pp. 1135–1143.

[54] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,“Deep Learning with Limited Numerical Precision,” arXiv.org, p.arXiv:1502.02551, Feb. 2015.

[55] T. Dettmers, “8-Bit Approximations for Parallelism in Deep Learning,”arXiv.org, p. arXiv:1511.04561, Nov. 2015.

[56] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradientdescent and its application to data-parallel distributed training of speechdnns,” in Fifteenth Annual Conference of the International SpeechCommunication Association, 2014.

[57] C. Hardy, E. Le Merrer, and B. Sericola, “Distributed deep learningon edge-devices: Feasibility via adaptive compression,” in 2017 IEEE16th International Symposium on Network Computing and Applications(NCA). IEEE, pp. 1–8.

[58] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep GradientCompression: Reducing the Communication Bandwidth for DistributedTraining,” arXiv.org, p. arXiv:1712.01887, Dec. 2017.

[59] A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, meth-ods, and analytics,” International Journal of Information Management,vol. 35, no. 2, pp. 137–144, Apr. 2015.

[60] UN General Assembly, “Universal Declaration of Human Rights,” Oct.2015.

[61] ——, “International Covenant on Civil and Political Rights,” Dec. 1966.[62] Council of Europe, “European convention for the protection of human

rights and fundamental freedoms ,” Nov. 1950.[63] European Commision, “Regulation (EU) 2016/679 of the European

Parliament and of the Council of 27 April 2016 on the protection ofnatural persons with regard to the processing of personal data and on thefree movement of such data, and repealing Directive 95/46/EC (GeneralData Protection Regulation),” Official Journal of the European Union,vol. L119, pp. 1–88, May 2016.

[64] C. Dwork, “Differential Privacy,” in Automata, Languages and Program-ming. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 1–12.

[65] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preservingdata publishing: A survey of recent developments,” ACM ComputingSurveys (CSUR), vol. 42, no. 4, pp. 14–53, Jun. 2010.

[66] H. T. Greely, “The uneasy ethical and legal underpinnings of large-scalegenomic biobanks,” Annual Review of Genomics and Human Genetics,vol. 8, no. 1, pp. 343–364, 2007.

[67] A. Narayanan and V. Shmatikov, “Robust De-anonymization of LargeSparse Datasets,” in 2008 IEEE Symposium on Security and Privacy (sp2008). IEEE, 2008, pp. 111–125.

[68] A. Tockar. (2014, Sep.) Riding with the Stars: Pas-senger Privacy in the NYC Taxicab Dataset. [Online].Available: https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

[69] L. Sweeney, “K-anonymity: A Model for Protecting Privacy,” Int. J.Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 557–570,Oct. 2002.

[70] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam,“L-diversity: privacy beyond k-anonymity,” in 22nd International Con-ference on Data Engineering. IEEE, 2006, pp. 24–24.

[71] N. Li, T. Li, and S. Venkatasubramanian, “t-Closeness: Privacy Beyondk-Anonymity and l-Diversity,” in 2007 IEEE 23rd International Con-ference on Data Engineering. IEEE, 2007, pp. 106–115.

[72] C. Gentry, “Computing arbitrary functions of encrypted data,” Commu-nications of the Acm, vol. 53, no. 3, pp. 97–105, Mar. 2010.

[73] A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A Survey onHomomorphic Encryption Schemes: Theory and Implementation,” ACMComputing Surveys (CSUR), vol. 51, no. 4, pp. 79–35, Sep. 2018.

[74] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, andJ. Wernsing, “Cryptonets: Applying neural networks to encrypted datawith high throughput and accuracy,” in International Conference onMachine Learning, 2016, pp. 201–210.

[75] E. Hesamifard, H. Takabi, and M. Ghasemi, “CryptoDL: Deep NeuralNetworks over Encrypted Data,” arXiv.org, p. arXiv:1711.05189, Nov.2017.

[76] L. Rist. (2018, Jan.) Encrypt your Machine Learning. [Online].Available: https://medium.com/corti-ai/encrypt-your-machine-learning-12b113c879d6

[77] Y. Du, L. Gustafson, D. Huang, and K. Peterson, Implementing MLAlgorithms with HE. MIT Course 6.857: Computer and NetworkSecurity, 2017.

[78] E. Chou, J. Beal, D. Levy, S. Yeung, A. Haque, and L. Fei-Fei, “FasterCryptoNets: Leveraging Sparsity for Real-World Encrypted Inference,”arXiv.org, p. arXiv:1811.09953, Nov. 2018.

[79] O. Goldreich, “Secure multi-party computation,” Manuscript. Prelimi-nary version, vol. 78, 1998.

[80] A. Shamir, “How to share a secret,” Communications of the Acm, vol. 22,no. 11, pp. 612–613, Nov. 1979.

[81] J. Launchbury, D. Archer, T. DuBuisson, and E. Mertens, “Application-Scale Secure Multiparty Computation,” in Programming Languages andSystems. Berlin, Heidelberg: Springer, Berlin, Heidelberg, Apr. 2014,pp. 8–26.

[82] C. Dwork and A. Roth, “The Algorithmic Foundations of DifferentialPrivacy,” Foundations and Trends in Theoretical Computer Science,vol. 9, no. 3–4, pp. 211–407, Aug. 2014.

[83] Q. Geng, P. Kairouz, S. Oh, and P. Viswanath, “The Staircase Mecha-nism in Differential Privacy,” IEEE Journal of Selected Topics in SignalProcessing, vol. 9, no. 7, pp. 1176–1184, Oct. 2015.

[84] S. P. Kasiviswanathan and A. Smith, “On the’semantics’ of differentialprivacy: A bayesian formulation,” Journal of Privacy and Confidential-ity, vol. 6, no. 1, 2014.

[85] T. Zhu, G. Li, W. Zhou, and P. S. Yu, “Preliminary of DifferentialPrivacy,” in Differential Privacy and Applications. Cham: SpringerInternational Publishing, 2017, pp. 7–16.

[86] J. Lee and C. Clifton, “How Much Is Enough? Choosing ε for Differ-ential Privacy,” in Information Security. Berlin, Heidelberg: Springer,Berlin, Heidelberg, Oct. 2011, pp. 325–340.

[87] S. L. Warner, “Randomized Response: A Survey Technique for Elim-inating Evasive Answer Bias,” Journal of the American StatisticalAssociation, vol. 60, no. 309, p. 63, Mar. 1965.

[88] R. C. Geyer, T. Klein, and M. Nabi, “Differentially Private FederatedLearning: A Client Level Perspective,” arXiv.org, p. arXiv:1712.07557,Dec. 2017.

[89] Apple. (2017, Nov.) Apple Differential Privacy Technical Overview.[Online]. Available: https://www.apple.com/privacy/docs/DifferentialPrivacy Overview.pdf

[90] A. G. Thakurta, “Differential Privacy: From Theory to Deployment,” inUSENIX Association. Vancouver, BC: USENIX Association, 2017.

[91] U. Erlingsson, V. Pihur, and A. Korolova, “RAPPOR: RandomizedAggregatable Privacy-Preserving Ordinal Response,” in Proceedingsof the ACM SIGSAC Conference on Computer and CommunicationsSecurity. ACM, Nov. 2014, pp. 1054–1067.

[92] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learn-ing Differentially Private Recurrent Language Models,” arXiv.org, p.arXiv:1710.06963, Oct. 2017.

[93] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan,S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical Secure Aggre-gation for Privacy-Preserving Machine Learning,” in Proceedings of theACM SIGSAC Conference on Computer and Communications Security.ACM, Oct. 2017, pp. 1175–1191.

[94] B. Hitaj, G. Ateniese, and F. Perez-Cruz, “Deep Models Under theGAN,” in the 2017 ACM SIGSAC Conference. New York, New York,USA: ACM Press, 2017, pp. 603–618.

https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

https://medium.com/corti-ai/encrypt-your-machine-learning-12b113c879d6

https://medium.com/corti-ai/encrypt-your-machine-learning-12b113c879d6

https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf

https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf

[95] X. Zhang, S. Ji, H. Wang, and T. Wang, “Private, Yet Practical,Multiparty Deep Learning,” in 2017 IEEE 37th International Conferenceon Distributed Computing Systems (ICDCS). IEEE, 2017, pp. 1442–1452.

[96] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan,T. Van Overveldt, D. Petrou, D. Ramage, and J. Roselander, “To-wards Federated Learning at Scale: System Design,” arXiv.org, p.arXiv:1902.01046, Feb. 2019.

[97] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar,“Semi-supervised Knowledge Transfer for Deep Learning from PrivateTraining Data,” arXiv.org, p. arXiv:1610.05755, Oct. 2016.

[98] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, andU. Erlingsson, “Scalable Private Learning with PATE,” arXiv.org, p.arXiv:1802.08908, Feb. 2018.

A Review of Privacy Preserving Federated Learning for ... · federated learning to IoT applications...

Documents

Transcript of A Review of Privacy Preserving Federated Learning for ... · federated learning to IoT applications...