BYZANTINE-RESILIENT SECURE FEDERATED LEARNING · Byzantine-resilience,privacy,and...

arX

iv:2

007.

1111

5v2

[cs

.CR

] 2

0 Fe

b 20

211

Byzantine-Resilient Secure Federated LearningJinhyun So, Başak Güler, A. Salman Avestimehr

Abstract—Secure federated learning is a privacy-preservingframework to improve machine learning models by trainingover large volumes of data collected by mobile users. This isachieved through an iterative process where, at each iteration,users update a global model using their local datasets. Eachuser then masks its local update via random keys, and themasked models are aggregated at a central server to computethe global model for the next iteration. As the local updates areprotected by random masks, the server cannot observe their truevalues. This presents a major challenge for the resilience of themodel against adversarial (Byzantine) users, who can manipulatethe global model by modifying their local updates or datasets.Towards addressing this challenge, this paper presents the firstsingle-server Byzantine-resilient secure aggregation framework(BREA) for secure federated learning. BREA is based on anintegrated stochastic quantization, verifiable outlier detection,and secure model aggregation approach to guarantee Byzantine-resilience, privacy, and convergence simultaneously. We providetheoretical convergence and privacy guarantees and characterizethe fundamental trade-offs in terms of the network size, userdropouts, and privacy protection. Our experiments demonstrateconvergence in the presence of Byzantine users, and comparableaccuracy to conventional federated learning benchmarks.

Index Terms—Federated learning, privacy-preserving machinelearning, Byzantine-resilience, distributed training in mobilenetworks.

I. INTRODUCTION

Federated learning is a distributed training framework thathas received significant interest in the recent years, by allowingmachine learning models to be trained over the vast amount ofdata collected by mobile devices [2], [3]. In this framework,training is coordinated by a central server who maintains aglobal model, which is updated by the mobile users throughan iterative process. At each iteration, the server sends thecurrent version of the global model to the mobile devices,who update it using their local data and create a local update.The server then aggregates the local updates of the users andupdates the global model for the next iteration [2]–[9].

Security and privacy considerations of distributed learningare mainly focused around two seemingly separate directions:1) ensuring robustness of the global model against adversarialmanipulations and 2) protecting the privacy of individualusers. The first direction aims at ensuring that the trainedmodel is robust against Byzantine faults that may occur inthe training data or during protocol execution. These faults

Jinhyun So is with the Department of Electrical and Computer Engineering,University of Southern California, Los Angeles, CA, 90089 USA (e-mail:[email protected]). Başak Güler is with the Department of Electrical andComputer Engineering, University of California, Riverside, CA, 92521 USA(email: [email protected]). A. Salman Avestimehr is with the Departmentof Electrical and Computer Engineering, University of Southern California,Los Angeles, CA, 90089 USA (e-mail: [email protected]).

This work is published in Journal on Selected Areas in Communications[1].

may result either from an adversarial user who can manipulatethe training data or the information exchanged during theprotocol, or due to device malfunctioning. Notably, it has beenshown that even a single Byzantine fault can significantly alterthe trained model [10]. The primary approach for defendingagainst Byzantine faults is by comparing the local updatesreceived from different users and removing the outliers at theserver [10]–[14]. Doing so, however, requires the server tolearn the true values of the local updates of each individualuser. The second direction aims at protecting the privacy ofthe individual users, by keeping each local update private fromthe server and the other users participating in the protocol[3]–[8]. This is achieved through what is known as a secureaggregation protocol [3]. In this protocol, each user masks itslocal update through additive secret sharing using private andpairwise random keys before sending it to the server. Oncethe masked models are aggregated at the server, the additionalrandomness cancels out and the server learns the aggregateof all user models. At the end of the protocol, the serverlearns no information about the individual models beyond theaggregated model, as they are masked by the random keysunknown to the server. In contrast, conventional distributedtraining frameworks that perform gradient aggregation andmodel updates using the true values of the gradients may revealextensive information about the local datasets of the users, asshown in [15]–[17].

This presents a major challenge in developing a Byzantine-resilient, and at the same time, privacy-preserving federatedlearning framework. On the one hand, robustness againstByzantine faults requires the server to obtain the individualmodel updates in the clear, to be able to compare the updatesfrom different users with each other and remove the outliers.On the other hand, protecting user privacy requires eachindividual model to be masked with random keys, as a result,the server only observes the masked model, which appears as auniformly random vector that could correspond to any point inthe parameter space. Our goal is to reconcile these two criticaldirections. In particular, we want to address the followingquestion, “How can one make federated learning protocolsrobust against Byzantine adversaries while preserving the

privacy of individual users?”.

In this paper, we propose the first single-server Byzantine-resilient secure aggregation framework, BREA, towards ad-dressing this problem. Our framework is built on the followingmain principles. Given a network of # mobile users withup to � adversaries, each user initially secret shares its localupdate with the other users, through a verifiable secret sharingprotocol [18]. However, doing so requires the local updates tobe masked by uniformly random vectors in a finite field [19],whereas the model updates during training are performed inthe domain of real numbers. In order to handle this problem,

http://arxiv.org/abs/2007.11115v2

2

BREA utilizes stochastic quantization to transfer the localupdates from the real domain into a finite field.

Verifiable secret sharing allows the users to perform consis-tency checks to validate the secret shares and ensure that everyuser follows the protocol. However, a malicious user can stillmanipulate the global model by modifying its local update orprivate dataset. BREA handles such attacks through a robustgradient descent approach, enabled by secure computationsover the secret shares of the local updates. To do so, each userlocally computes the pairwise distances between the secretshares of the local updates belonging to other users, and sendsthe computation results to the server. Since these computationsare carried out using the secret shares, users do not learn thetrue values of the local updates belonging to other users.

In the final phase, the server collects the computation resultsfrom a sufficient number of users, recovers the pairwise dis-tances between the local updates, and performs user selectionfor model aggregation. The user selection protocol is based ona distance-based outlier removal mechanism [10], to removethe effect of potential adversaries and to ensure that theselected models are sufficiently close to an unbiased gradientestimator. After the user selection phase, the secret sharesof the models belonging to the selected users are aggregatedlocally by the mobile users. The server then gathers the securecomputation results from the users, reconstructs the true valueof the aggregate of the selected user models, and updatesthe global model. Our framework guarantees the privacy ofindividual user models, in particular, the server learns noinformation about the local updates, beyond their aggregatedvalue and the pairwise distances.

In our theoretical analysis, we demonstrate provable con-vergence guarantees for the model and robustness guaranteesagainst Byzantine adversaries. We then identify the theoreticalperformance limits in terms of the fundamental trade-offs be-tween the network size, user dropouts, number of adversaries,and privacy protection. Our results demonstrate that, in a net-work with # mobile users, BREA can theoretically guarantee:i) robustness of the trained model against up to � Byzantineadversaries, ii) tolerance against up to � user dropouts, iii)privacy of each local update, against the server and up to )colluding users, as long as # ≥ 2� + 1 +max{< + 2, � + 2)},where < is the number of selected models for aggregation.

We then numerically evaluate the performance of BREA andcompare it to the conventional federated learning protocol, thefederated averaging scheme of [2]. To do so, we implementBREA in a distributed network of # = 40 users with up to� = 12 Byzantine users who can send arbitrary vectors tothe server or to the honest users. We demonstrate that BREAguarantees convergence against Byzantine users and its conver-gence rate is comparable to the convergence rate of federatedaveraging. BREA also has comparable test accuracy to thefederated averaging scheme while BREA entails quantizationloss to preserve the privacy of individual users.

Finally, while BREA provides the first distributed train-ing framework that is both Byzantine-resilient and privacy-preserving, there are several directions for further improve-ments. The first is that our theoretical analysis for convergenceneeds the assumption of an independent and identically dis-

tributed (i.i.d.) data distribution over the users. Second, we relyon distance-based outlier removal mechanisms, which makeit difficult to distinguish whether a large distance betweenlocal updates is due to a non-i.i.d. data distribution or aByzantine attack. Providing theoretical performance guaran-tees for non-i.i.d. data in the Byzantine-robust training setupsis an important and interesting direction for future research.However, one should also note that even without the privacyrequirements, this is still an open problem and an activearea of research in the literature [20], [21]. Other openquestions include whether it is possible to break the quadraticcommunication and computation complexity of Byzantine-resilient secure aggregation in the current work, and investigat-ing the fundamental performance limits of Byzantine-resilientdistributed learning with multiple local updates.

II. RELATED WORK

In the non-Byzantine federated learning setting, secure ag-gregation is performed through a procedure known as additivemasking [3], [22]. In this setup, users first agree on pairwisesecret keys using a Diffie-Hellman type key exchange protocol[23], and send a masked version of their local update tothe server, where the masking is done using pairwise andprivate secret keys. When the masked models are aggregatedat the server, additive masks cancel out, allowing the server tolearn the aggregate of the local updates. This process workswell if no users drop during the execution of the protocol.In wireless environments, however, users may drop from theprotocol anytime due to the variations in channel conditions.Such user dropouts are handled by letting each user secretshare their private and pairwise keys through Shamir’s secretsharing [19]. The server can remove the additive masks bycollecting the secret shares from the surviving users. Thisapproach leads to a quadratic communication overhead in thenumber of users. More recent approaches have focused onreducing the communication overhead, by training in a smallerparameter space [24], autotuning the parameters [25], or byutilizing coding techniques [7].

Another line of work has focused on differentially-privatefederated learning approaches [26], [27], to protect the pri-vacy of personally-identifiable information against inferenceattacks. Although our focus is not on differential-privacy,our approach may in principle be combined with differen-tial privacy techniques [28], which is an interesting futuredirection. Another important direction in federated learningis the study of fairness and how to avoid biasing the modeltowards specific users [29], [30]. The convergence propertiesof federated learning models are investigated in [31].

Distributed training protocols have been extensively studiedin the Byzantine setting using clear (unmasked) model updates[10]–[14], [20], [21]. The main defense mechanism to protectthe trained model against Byzantine users is by comparing themodel updates received from different users, and removing theoutliers. Doing so ensures that the selected model updates areclose to each other, as long as the network has a sufficientlylarge number of honest users. A related line of work is modelpoisoning attacks, which are studied in [32], [33].

3

user 1 user Nuser 2

updated global model

· · ·

w(t)2

secure aggregationN∑

i=1

w(t)i

w(t+1)

w(t)Nw

(t)1

local update

w(t)

w(t)

w(t)

server

Fig. 1. Secure aggregation in federated learning. At iteration C , the serversends the current state of the global model, denoted by w(C ) , to the mobileusers. User 8 ∈ [# ] forms a local update w(C )

8by updating the global model

using its local dataset. The local updates are aggregated in a privacy-preservingprotocol at the server, who then updates the global model, and sends the newmodel, w(C+1) , to the mobile users.

In concurrent work, a Byzantine-robust secure gradientdescent algorithm has been proposed for a two-server modelin [34], however, unlike federated learning (which is basedon a single-server architecture) [2], [3], this work requirestwo honest (non-colluding) servers who both interact with themobile users and communicate with each other to carry outa secure two-party protocol, but do not share any sensitiveinformation with each other in an attempt to breach userprivacy. In contrast, our goal is to develop a single-serverByzantine-resilient secure training framework, to facilitate ro-bust and privacy-preserving training architectures for federatedlearning. Compared to the two-server models, single servermodels carry the additional challenge where all informationhas to be collected at a single server, while still being able tokeep the individual models of the users private.

The remainder of the paper is organized as follows. InSection III, we provide background on federated learning.Section IV presents our system model along with the keyparameters used to evaluate the system performance. Section Vintroduces our framework and the details of the specific com-ponents. Section VI presents our theoretical results, whereasour numerical evaluations are provided in Section VII, todemonstrate the convergence and Byzantine-resilience. Thepaper is concluded in Section VIII. The following notationis used throughout the paper. We represent a scalar variablewith G, whereas x represents a vector. A set is denoted by X,whereas [#] refers to the set {1, . . . , #}.

III. BACKGROUND

Federated learning is a distributed training framework formachine learning in mobile networks while preserving theprivacy of users. Training is coordinated by a central serverwho maintains a global model w ∈ R3 with dimension 3. Thegoal is to train the global model using the data held at mobiledevices, by minimizing a global objective function � (w) as,

minw� (w). (1)

The global model is updated locally by mobile users onsensitive private datasets, by letting

� (w) =#∑

8=1

�8

��8 (w) (2)

where # is the total number of mobile users, �8 (w) denotesthe local objective function of user 8, �8 is the number ofdata points in user 8’s private dataset D8 , and � :=

∑#8=1 �8 .

For simplicity, we assume that users have equal-sized datasets,i.e., �8 = �# for all 8 ∈ [#].

Training is performed through an iterative process wheremobile users interact through the central server to update theglobal model. At each iteration, the server shares the currentstate of the global model, denoted by w(C) , with the mobileusers. Each user 8 creates a local update,

w(C)8

= 6(w(C) , b (C)8

) (3)

where 6 is an estimate of the gradient ∇� (w(C) ) of the costfunction � and b (C)

8is a random variable representing the

random sample (or a mini-batch of samples) drawn from D8 .We assume that the private datasets {D8}8∈[# ] have the samedistribution and {b (C)

8}8∈[# ] are i.i.d. b (C)8 ∼ b where b is a

uniform random variable such that each w(C)8

is an unbiasedestimator of the true gradient ∇� (w(C) ), i.e.,

Eb [6(w(C) , b (C)8 )] = ∇� (w(C) ). (4)

The local updates are aggregated at the server in a privacy-preserving protocol, such that the server only learns the aggre-gate of a large fraction of the local updates, ideally the sumof all user models

∑

8∈[# ] w(C)8

, but no further information isrevealed about the individual models beyond their aggregatedvalue. Using the aggregate of the local updates, the serverupdates the global model for the next iteration,

w(C+1) = w(C) − W (C)∑

8∈[# ]w

(C)8

(5)

where W (C) is the learning rate, and sends the updated modelw(C+1) to the users. This process is illustrated in Figure 1.

Conventional secure aggregation protocols require each userto mask its local update using random keys before aggregation[3], [7], [35]. This is typically done by creating pairwise keysbetween the users through a key exchange protocol [23]. Usingthe pairwise keys, each pair of users 8, 9 ∈ [#] agree on apairwise random seed 0 (C)

8 9. User 8 also creates a private random

seed 1 (C)8

, which protects the privacy of the local update in casethe user is delayed instead of being dropped, in which casethe pairwise keys are not sufficient for privacy, as shown in[3]. User 8 ∈ [#] then sends a masked version of its localupdate w(C)

8, given by

y(C)8

:= w(C)8

+ PRG(1 (C)8) +

∑

9:8< 9

PRG(0 (C)8 9) −

∑

9:8> 9

PRG(0 (C)98)

(6)to the server, where PRG is a pseudo random generator. User 8then secret shares 1 (C)

8and {0 (C)

8 9} 9∈[# ] with the other users, via

Shamir’s secret sharing [19]. For computing the aggregate ofthe user models, the server collects either the secret shares of

4

the pairwise seeds belonging to a dropped user, or the shares ofthe private seed belonging to a surviving user (but not both).The server then recovers the private seeds of the survivingusers and the pairwise seeds of the dropped users, and removesthem from the aggregate of the masked models,

y(C) =∑

8∈U

(

y(C)8

− PRG(1 (C)8))

−∑

8∈D

( ∑

9:8< 9

PRG(0 (C)8 9) −

∑

9:8> 9

PRG(0 (C)98))

=

∑

8∈Uw

(C)8

(7)

and obtains the aggregate of the local updates, where U ⊆ [#]and D ⊆ [#] denote the set of surviving and dropped users,respectively. In (7),

∑

8∈U PRG(1(C)8) corresponds to the re-

constructed private seeds belonging to the surviving users. On

the other hand,∑

8∈D

(∑

9:8< 9 PRG(0(C)8 9) −∑ 9:8> 9 PRG(0

(C)98))

corresponds to the reconstructed pairwise seeds belonging tothe dropped users. Both of these terms are reconstructed bythe server to remove the random masks in the aggregate of themasked versions of the surviving users,

∑

8∈U y(C)8

. At the end,all of the random masks cancel out, and the server recoversthe summation of the original models belonging to all of thesurviving users, i.e.,

∑

8∈U w(C)8

.

IV. PROBLEM FORMULATION

In this section, we describe the Byzantine-resilient secureaggregation problem, by extending the conventional secureaggregation scenario from Section III to the case when someusers, known as Byzantine adversaries, can manipulate thetrained model by modifying their local datasets or by sharingfalse information during the protocol.

We consider a distributed network with # mobile users anda single server. User 8 ∈ [#] holds a local update1 w8 ofdimension 3. The goal is to aggregate the local updates atthe server, while protecting the privacy of individual users.However, unlike the non-Byzantine setting of Section III, theaggregation operation in the Byzantine setting should be robustagainst potentially malicious users. To this end, we representthe aggregation operation by a function,

5 (w1, . . . ,w# ) =∑

8∈Sw8 (8)

where S is a set of users selected by the server for aggregation.The role of S is to remove the effect of potentially Byzantineadversaries on the trained model, by removing the outliers.Similar to prior works on federated learning, our focus ison computationally-bounded parties, whose strategies can bedescribed by a probabilistic polynomial time algorithm [3].

We evaluate the performance of a Byzantine-resilient secureaggregation protocol according to the following key parame-ters:

• Robustness against Byzantine users: We assume that upto � users are Byzantine (malicious), who manipulate the

1For notational clarity, throughout Sections IV and V, we omit the iterationnumber (C) from w(C )

8.

protocol by modifying their local datasets or by sharingfalse information during protocol execution. The protocolshould be robust against such Byzantine adversaries.

• Privacy of local updates: The aggregation protocol shouldprotect the privacy of any individual user from the serverand any collusions between up to ) users. Specifically,the local update of any user should not be revealed tothe server or the remaining users, even if up to ) userscooperate with each other by sharing information.2

• Tolerance to user dropouts: Due to potentially poorwireless channel conditions, we assume that up to � usersmay get dropped or delayed at any time during protocolexecution. The protocol should be able to tolerate suchdropouts, i.e., the privacy and convergence guaranteesshould hold even if up to � users drop or get delayed.

In this paper, we present a single-server Byzantine-resilientsecure aggregation framework (BREA) for the computation of(8). BREA consists of the following key components:

1) Stochastic quantization: Users initially quantize theirlocal updates from the real domain to the domain ofintegers, and then embed them in a field F? of integersmodulo a prime ?. To do so, our framework utilizesstochastic quantization, which is instrumental in ourtheoretical convergence guarantees.

2) Verifiable secret sharing of the user models: Users thensecret share their quantized models using a verifiable se-cret sharing protocol. This ensures that the secret sharescreated by the mobile users are valid, i.e., Byzantineusers cannot cheat by sending invalid secret shares.

3) Secure distance computation: In this phase, users com-pute the pairwise distances between the secret shares ofthe local updates, and send the results to the server. Sincethis computation is performed using the secret shares ofthe models instead of their true values, users do not learnany information about the actual model parameters.

4) User selection at the server: Upon receiving the com-putation results from the users, the server recovers thepairwise distances between the local updates and selectsthe set of users whose models will be included inthe aggregation, by removing the outliers. This ensuresthat the aggregated model is robust against potentialmanipulations from Byzantine users. The server thenannounces the list of the selected users.

5) Secure model aggregation: In the final phase, eachuser locally aggregates the secret shares of the modelsselected by the server, and sends the computation resultto the server. Using the computation results, the serverrecovers the aggregate of the models of the selectedusers, and updates the model.

In the following, we describe the details of each phase.

V. THE BREA FRAMEWORK

In this section, we present the details of the BREA frame-work for Byzantine-resilient secure federated learning.

2Collusions that may occur between the server and the users are beyondthe scope of our paper.

5

A. Stochastic Quantization

In BREA, the operations for verifiable secret sharing andsecure distance computations are carried out over a finitefield F? for some large prime ?. To this end, user 8 ∈ [#]initially quantizes its local update w8 from the domain ofreal numbers to the finite field. We assume that the field size? is large enough to avoid any wrap-around during securedistance computation and secure model aggregation, whichwill be described in Sections V-C and V-E, respectively.

Quantization requires a challenging task as it should beperformed in a way to ensure the convergence of the model.Moreover, the quantization function should allow the represen-tation of negative integers in the finite field, and facilitate com-putations to be performed in the quantized domain. Therefore,we cannot utilize well-known gradient quantization techniquessuch as in [36], which represents the sign of a negative numberseparately from its magnitude. BREA addresses this challengewith a simple stochastic quantization strategy as follows. Forany integer @ ≥ 1, we define a stochastic rounding function:

&@ (G) ={ ⌊@G ⌋

@with prob. 1 − (@G − ⌊@G⌋)

⌊@G ⌋+1@

with prob. @G − ⌊@G⌋(9)

where ⌊G⌋ is the largest integer less than or equal to G, andnote that this function is unbiased, i.e., E& [&@ (G)] = G.Parameter @ is a tuning parameter that corresponds to thenumber of quantization levels. Variance of &@ (G) decreases asthe value of @ increases, which will be detailed in Lemma 1in Section VI. We then define the quantized model,

w8 := q(@ · &@ (w8)) (10)

where the function &@ from (9) is carried out element-wise,and a mapping function q : R→ F? is defined to represent anegative integer in the finite field by using two’s complementrepresentation,

q(G) ={

G if G ≥ 0? + G if G < 0. (11)

B. Verifiable Secret Sharing of the User Models

BREA protects the privacy of individual user modelsthrough verifiable secret sharing. This is to ensure that theindividual user models are kept private while preventing theByzantine users from breaking the integrity of the protocol bysending invalid secret shares to the other users.

To do so, user 8 ∈ [#] secret shares its quantized model w8with the other users through a non-interactive verifiable secretsharing protocol [18]. Our framework leverages Feldman’sverifiable secret sharing protocol from [18], which combinesShamir’s secret sharing [19] with homomorphic encryption. Inthis setup, each party creates the secret shares, then broadcastscommitments to the coefficients of the polynomial they usefor Shamir’s secret sharing, so that other parties can verifythat the secret shares are constructed correctly. To verifythe secret shares from the given commitments, the protocolleverages the homomorphic property of exponentiation, i.e.,4G?(0 + 1) = 4G?(0)4G?(1), whereas the privacy protectionis based on the assumption that computation of the discretelogarithm in the finite field is intractable.

The individual steps carried out for verifiable secret sharingin our framework are as follows. Initially, the server and usersagree on # distinct elements {\8}8∈[# ] from F? . This can bedone offline by using a conventional majority-based consensusprotocol [37], [38]. User 8 ∈ [#] then generates secret sharesof the quantized model w8 by forming a random polynomial58 :F?→F3? of degree ) ,

58 (\) = w8 +)∑

9=1

r8 9\9 (12)

in which the vectors r8 9 are generated uniformly at randomfrom F3? by user 8. User 8 then sends a secret share of w8 touser 9 , denoted by,

s8 9 = 58 (\ 9 ). (13)

To make these shares verifiable, user 8 also broadcasts com-mitments to the coefficients of 58 , given by

c8 9 =

{

kw8 for 9 = 0kr8 9 for 9 = 1, . . . , ) .

(14)

where k denotes a generator of F?, and all arithmetic is takenmodulo _ for some large prime _ such that ? divides _ − 1.

Upon receiving the commitments in (14), each user 9 ∈ [#]can verify the secret share s8 9 = 58 (\ 9 ) by checking

ks8 9 =

)∏

:=0

c\:9

8:(15)

where all arithmetic is taken modulo _. This commitmentscheme ensures that the secret shares are created correctlyfrom the polynomial in (12), hence they are valid. On the otherhand, as we assume the intractability of computing the discretelogarithm [18], the server or the users cannot compute thediscrete logarithm logk (c8C ) and reveal the quantized modelw8 from c80 in (14).

C. Secure Distance Computation

Verifiable secret sharing of the model parameters, as de-scribed in Section V-B, ensures that the users follow theprotocol correctly by creating valid secret shares. However,malicious users can still try to manipulate the trained modelby modifying their local updates instead. In this case, the secretshares will be created correctly but according to a false model.In order to ensure that the trained model is robust againstsuch adversarial manipulations, BREA leverages a distance-based outlier detection mechanism, such as in [10], [39]. Themain principle behind these mechanisms is to compute thepairwise distances between the local updates and select aset of models that are sufficiently close to each other. Onthe other hand, the outlier detection mechanism in BREAhas to protect the privacy of local updates, and performingthe distance computations on the true values of the modelparameters would breach the privacy of individual users.

We address this by a privacy-preserving distance computa-tion approach, in which the pairwise distances are computedlocally by each user, using the secret shares of the modelparameters received from the other users. In particular, upon

6

receiving the secret shares of the model parameters as de-scribed in Section V-B, user 8 computes the pairwise distances,

3(8)9 :

:= ‖s 98 − s:8 ‖2 (16)

between each pair of users 9 , : ∈ [#], and sends the resultto the server. Since the computations in (16) are performedover the secret shares, user 8 learns no information about thetrue values of the model parameters w 9 and w: of users 9 and:, respectively. Finally, we note that the computation resultsfrom (16) are scalar values.

D. User Selection at the Server

Upon receiving the computation results in (16) from asufficient number of users, the server reconstructs the truevalues of the pairwise distances. During this phase, Byzantineusers may send incorrect computation results to the server,hence the reconstruction process should be able to correct thepotential errors that may occur in the computation results dueto malicious users. Our decoding procedure is based on thedecoding of Reed-Solomon codes.

The intuition of the decoding process is that the computa-tions from (16) correspond to evaluation points of a univariatepolynomial ℎ 9 : :F?→F? of degree at most 2) , where

ℎ 9 : (\) := ‖ 5 9 (\) − 5: (\)‖2 (17)

for \ ∈ {\8}8∈[# ] and 9 , : ∈ [#]. Accordingly, ℎ 9 : can beviewed as the encoding polynomial of a Reed-Solomon codewith degree at most 2) , such that the missing computationsdue to the dropped users correspond to the erasures in thecode, and manipulated computations from Byzantine usersrefer to the errors in the code. Therefore, the decoding processof the server corresponds to decoding an [#, 2) +1, # −2)] ?Reed-Solomon code with at most � erasures and at most� errors. By utilizing well-known Reed-Solomon decodingalgorithms [40], the server can recover the polynomial ℎ 9 :and obtain the true value of the pairwise distances by usingthe relation ℎ 9 : (0) = ‖ 5 9 (0) − 5: (0)‖2 = ‖w 9 − w: ‖2. At theend, the server learns the pairwise distances

3 9 : := ‖w 9 − w: ‖2 (18)

between the models of users 9 , : ∈ [#]. Then the serverconverts (18) from the finite field to the real domain as follows,

3 9 : =q−1(3 9 : )

@2(19)

for 9 , : ∈ [#], where @ is the integer parameter in (9) and thedemapping function q−1 : F? → R is defined as

q−1(G) ={

G if 0 ≤ G < ?−12

G − ? if ?−12

≤ G < ?. (20)

We assume the field size ? is large enough to ensure the correctrecovery of the pairwise distances,

3 9 : =q−1

(

‖q(@ · &@ (w 9 )) − q(@ ·&@ (w:))‖2)

@2

=q−1

(

q(@2‖&@ (w 9 ) −&@ (w: )‖2))

@2(21)

= ‖&@ (w 9 ) − &@ (w:)‖2 (22)

where &@ is the stochastic rounding function defined in (9)and (21) holds if

@2‖&@ (w 9 ) −&@ (w: )‖2 < (? − 1)/2. (23)

By utilizing the pairwise distances in (22), the server carriesout a distance-based outlier removal algorithm to select theset of users to include in the final model aggregation. Theoutlier removal procedure of BREA follows the multi-Krumalgorithm from [10], [39]. The main difference is that ourframework considers the multi-Krum algorithm in a quantizedstochastic gradient setting, as BREA utilizes quantized gradi-ents instead of the true gradients, in order to enable privacy-preserving computations on the secret shares. We present thetheoretical convergence guarantees of this quantized multi-Krum algorithm in Section VI, and numerically demonstrateits convergence behaviour in our experiments in Section VII.

In this setup, the server selects < users through the follow-ing iterative process. At each iteration : ∈ [

7

One can then observe that ℎ is the encoding polynomial of aReed-Solomon code with degree at most ) , the missing com-putations due to the dropped users correspond to the erasuresin the code, and manipulated computations from Byzantineusers correspond to the errors in the code. Therefore, thedecoding process of the server corresponds to decoding an[#,)+1, #−)] ? Reed-Solomon code with at most � erasuresand at most � errors. Hence, by using a Reed-Solomondecoding algorithm, the server can recover the polynomial ℎand obtain the true value of the aggregate of the selected usermodels by using the relation ℎ(0) = ∑ 9∈S 5 9 (0) =

∑

9∈S w 9 .We note that the total number of users selected by the serverfor aggregation, i.e., |S| =

8

where f′(w) =√

14@2

+ f2 (w).

Proof. (Unbiasedness) Given &@ in (9) and any random vari-able G, it follows that,

E& [&@ (G) | G] =⌊@G⌋@

(1 − (@G − ⌊@G⌋))

+ (⌊@G⌋ + 1)@

(@G − ⌊@G⌋)

= G (36)

from which we obtain the unbiasedness condition in (34),

E&,b [&@ (6(w, b))] = Eb[

E& [&@ (6(w, b)) | 6(w, b)]]

= Eb

[

6(w, b)]

= ∇� (w). (37)

(Bounded variance) Next, we observe that,

E&

[(

&@ (G) − E& [&@ (G) | G])2 | G

]

=

( ⌊@G⌋@

− G)2

(1 − (@G − ⌊@G⌋))

+( ⌊@G⌋ + 1

@− G

)2

(@G − ⌊@G⌋)

=1

@2

(1

4−(

@G − ⌊@G⌋ − 12

)2)

≤ 14@2

(38)

from which one can obtain the bounded variance condition in(35) as follows,

E&,b ‖&@ (6(w, b)) − ∇� (w)‖2

= Eb

[

E& [‖&@ (6(w, b)) − ∇� (w)‖2 | 6(w, b)]]

≤ Eb[

E& [‖&@ (6(w, b)) − 6(w, b)‖2 | 6(w, b)]]

+ Eb[

E& [‖6(w, b) − ∇� (w)‖2 | 6(w, b)]]

(39)

≤ 34@2

+ 3f2 (w) (40)

= 3f′ 2 (w)

where (39) follows from the triangle inequality and (40)follows form (38). �

As discussed in Section IV, Byzantine users can manipulatethe training protocol via two means, either by modifying theirlocal update (directly or by modifying the local dataset), orby sharing false information during protocol execution. In thissection, we demonstrate how BREA provides robustness inboth cases. We first focus on the former case and study theresilience of the global model, i.e., conditions under whichthe trained model remains close to the true model, evenif some users modify their local updates adversarially. Thesecond case, i.e., robustness of the protocol when some usersexchange false information during the protocol execution, willbe considered in Theorem 1.

In order to evaluate the resilience of the global modelagainst Byzantine adversaries, we adopt the notion of (U, �)-Byzantine resilience from [39].

Definition 1 ((U, �)-Byzantine resilience, [39]). Let 0 ≤ U <c/2 be any angular value and 0 ≤ � ≤ # be any integer.Let w1, . . . ,w# ∈ R3 be any i.i.d random vectors such thatw8 ∼ w with E[w] = W. Let b1, . . . , b� ∈ R3 be any randomvectors. Then, function 5 is (U, �)-Byzantine resilient if, forany 1 ≤ 91 < · · · < 9� ≤ # ,

f := 5(

w1, . . . , b1︸︷︷︸

91

, . . . , b�︸︷︷︸

9�

, . . . ,w#)

(41)

satisfies, i) W⊤E[f] ≥ (1 − sin U)‖W‖2, and ii) forA ∈ {2, 3, 4}, E‖f‖A is bounded above by E‖f‖A ≤ ∑

A1+···+A#−�=A E‖W‖A1 . . . E‖W‖A#−� where denotes ageneric constant.

Lemma 2 below states that if the standard deviation causedby random sample selection and quantization is smaller thanthe norm of the true gradient, and 2� + 2 < # −

9

• (Robustness against Byzantine users) The protocol exe-cutes correctly against up to � Byzantine users and the

trained model is (U, �)-Byzantine resilient.• (Convergence) The sequence of the gradients ∇� (w(C) )

converges almost surely to zero,

∇� (w(C) ) 0.B.−−−−→C→∞

0. (45)

• (Privacy) The server or any group of up to ) users cannotcompute an unknown local update. For any set of users

T ⊂ [#] of size at most ) ,

P[User 8 has secret w8 | viewT]= P[User 8 has secret w8] (46)

for all 8 ∈ [#] \ T where viewT denotes the messagesthat the members of T receive.

for any # ≥ 2� + 1 + max{< + 2, � + 2)}, where < is thenumber of selected models for aggregation.

Remark 1. The two conditions∑∞C=1 W

(C)= ∞ and

∑∞C=1(W (C) )2 < ∞ are instrumental in the convergence

of stochastic gradient descent algorithms [41]. Condition∑∞C=1(W (C) )2 < ∞ states that the learning rates decrease fast

enough, whereas condition∑∞C=1 W

(C)= ∞ bounds the rate

of their decrease, to ensure that the learning rates do not

decrease too fast.

Remark 2. We consider a general (possibly non-convex) ob-

jective function �. In such scenarios, proving the convergence

of the model directly is challenging, and various approaches

have been proposed instead. Our approach follows [41] and

[10], where we prove the convergence of the gradient to a flat

region instead. We note, however, that such a region may refer

to any stationary point, including the local minima as well as

saddle and extremal points.

Proof. (Robustness against Byzantine users) The (U, �)-Byzantine resilience of the trained model follows fromLemma 2. We next provide sufficient conditions for BREAto correctly evaluate the update function (32), in the presenceof � Byzantine users. Byzantine users may send any arbitraryrandom vector to the server or other users in every step of theprotocol in Section V. In particular, Byzantine users can createand send incorrect computations in three attack scenarios: i)sending invalid secret shares s8 9 in (13), ii) sending incorrectsecure distance computations 3 (8)

9 :in (16), and iii) sending

incorrect aggregate of the secret shares s8 in (26).The first attack scenario occurs when the secret shares s8 9

in (13) do not refer to the same polynomial from (12). BREAutilizes verifiable secret sharing to prevent such attempts. Thecorrectness (validity) of the secret shares can be verified bytesting (15), whenever the majority of the surviving users arehonest, i.e., # > 2� + � [18], [37].

The second attack scenario can be detected and correctedby the Reed-Solomon decoding algorithm. In particular, asdescribed in Section V-D, given 9 , : ∈ [#], {3 (8)

9 :}8∈[# ] can

be viewed as # evaluation points of the polynomial ℎ 9 :given in (17) whose degree is at most 2) . The decodingprocess at the server then corresponds to the decoding of

an [#, 2) + 1, # − 2)] ? Reed-Solomon code with at most� erasures and at most � errors. As an [=, :, = − : + 1] ?Reed-Solomon code with 4 erasures can tolerate a maximumnumber of ⌊ =−:−4

2⌋ errors [40], the server can recover the

correct pairwise distances as long as � ≤ ⌊ #−(2) +1)−�2

⌋, i.e.# ≥ � + 2� + 2) + 1.

The third attack scenario can also be detected and correctedby the Reed-Solomon decoding algorithm. As described inSection V-E, {s8}8∈[# ] are evaluation points of polynomial ℎin (27) of degree at most ) . This decoding process correspondsto the decoding of an [#,) + 1, # − )] ? Reed-Solomoncode with at most � erasures and at most � errors. Assuch, the server can recover the desired aggregate modelℎ(0) = ∑ 9∈S w 9 as long as # ≥ � + 2� + ) + 1. Therefore,combining with the condition of Lemma 2, the sufficientconditions under which BREA guarantees robustness againstByzantine users is given by

# ≥ 2� + 1 + max{< + 2, � + 2)}. (47)

(Convergence) We now consider the update equation in (32)and prove the convergence of the random sequence ∇� (w(C) ).From Lemma 2, the quantized multi-Krum function 5 in(32) is (U, �)-Byzantine resilient. Hence, from Proposition 2of [39], ∇� (w(C) ) converges almost surely to zero,

∇� (w(C) ) 0.B.−−−−→C→∞

0. (48)

(Privacy) As described in Section V-B, we assume theintractability of computing discrete logarithms, hence theserver or any user cannot compute w8 from c80 in (14). Itis therefore sufficient to prove the privacy of each individualmodel against a group of ) colluding users, in the casewhere T has size ) . If ) users cannot get any informationabout w8 , then neither can fewer than ) users. Without lossof generality, let T = {1, . . . , )} = [)] and viewT ={(s: 9 ):∈[# ] ,

(

39

:;

)

:,;∈[# ] , s 9}

9∈[) ] where s: 9 in (13) is the

secret share of w: sent from user : to user 9 , 39

:;in (16)

is the pairwise distance of the secret shares sent from users :and ; to user 9 , and s 9 in (26) is the aggregate of the secretshares. As

{ (

39

:;

)

:,;∈[# ]}

9∈[) ] and{

s 9}

9∈[) ] are determined

by{ (

s: 9)

:,;∈[# ]}

9∈[) ] , we can simplify the left hand side(LHS) of (46) as

P[User 8 has secret w8 | viewT]= P[User 8 has secret w8 | {s: 9 }:∈[# ], 9∈[) ]] (49)= P[User 8 has secret w8 | {s8 9 } 9∈[) ]] . (50)

where (50) follows from the fact that s: 9 is independent ofw8 for any : ≠ 8, s: 9 is independent of w8 . Then, for anyrealization of vectors d0, . . . , d) ∈ F3?, we obtain,

P[User 8 has secret w8 | viewT]= P[w8 = d0 | s81 = d1, . . . , s8) = d) ]

=P[w8 = d0, s81 = d1, . . . , s8) = d) ]P[s81 = d1, . . . , s8) = d) ]

=1/|F3? |) +1

1/|F3? |)(51)

10

TABLE ICOMPLEXITY SUMMARY OF BREA.

Computation Communication

Server $ ((#3 + 3#) log2 # log log #) $ (3# + #3)User $ (3# log2 # + 3#2) $ (3# + #2)

=1

|F3? |= P[w8 = d0]

where (51) follows from the fact that any ) + 1 evaluationpoints define a unique polynomial of degree ) , which com-pletes the proof of privacy. �

A. Complexity Analysis

In this section, we analyze the complexity of BREA withrespect to the number of users, # , and model dimension 3.Complexity Analysis of the Users: User 8’s computationcost can be broken into three parts: 1) generating the secretshares s8 9 in (12) for 9 ∈ [#], 2) computing the pairwisedistances 3 (8)

9 :in (16) for 9 , : ∈ [#], and 3) aggregating the

secret shares belonging to the selected users from (26). First,generating # secret shares of a vector with dimension 3 has acomputation cost of $ (3# log2 #) [19]. Second, as there are$ (#2) pairwise distances, computing the pairwise distanceshas a computation cost of $ (3#2) in total. Third, when thenumber of selected users < = |S| is $ (#), aggregating thesecret shares belonging to the selected users has a computationcost of $ (3#). Therefore, the overall computation cost of eachuser is $ (3# log2 # + 3#2).

User 8’s communication cost can be broken to three parts:1) sending the secret share s8 9 to user 9 ∈ [#], 2) sending thesecret shares of the pairwise distances 3 (8)

9 :to the server for

9 , : ∈ [#], and 3) sending the aggregate of secret the sharess8 in (26) to the server. The communication cost of the threeparts are $ (3#), $ (#2), and $ (3), respectively. Therefore,the overall communication cost of each user is $ (3# + #2).Complexity Analysis of the Server: Computation cost ofthe server can be broken into three parts: 1) decoding thepairwise distances by recovering ℎ 9 : (0) in (17) for 9 , : ∈[#], 2) carrying out the multi-Krum algorithm to selectthe < users for aggregation, and 3) decoding the aggre-gate of the selected models by recovering ℎ(0) in (27). Asdescribed in Section V-D, recovering the polynomial ℎ 9 :corresponds to decoding an [#, 2) + 1, # − 2)] ? Reed-Solomon code with at most � erasures and at most � errors,which has an $ (# log2 # log log#) computation cost [40].As there are $ (#2) pairs and each pair is embedded in asingle polynomial, the computation cost of the first part is$ (#3 log2 # log log#) in total. The computation cost of thesecond part is $ (3#2 + #2 log#) [10]. As the dimensionof ℎ(0) is 3, the computation cost of the third part is$ (3# log2 # log log#). Overall, the computation cost of theserver is $ ((#3 + 3#) log2 # log log#).

Communication cost can be broken into two parts: 1) receiv-ing the secret shares of the pairwise distances 3 (8)

9 :from users

8 ∈ [#] for 9 , : ∈ [#] and 2) receiving the aggregate of secret

TABLE IICOMPLEXITY SUMMARY OF GENERALIZED BREA.

Computation Communication

Server $ ((#2 + 3#) log2 # log log #) $ (3# + #2)User $ (3# log2 # + 3#2) $ (3# + #)

shared models s8 from users 8 ∈ [#]. The communication costof the two parts are $ (#3) and $ (3#), respectively. Overall,the communication cost of the server is $ (3# + #3).

We summarize the complexity analysis in Table I. As canbe observed in Table I, the server has a communication costof $ (#3) and a computation cost of $ (#3 log2 # log log#),which is due to the recovery of #2 pairwise distances.Although the distances are scalar valued, the overhead canbecome a limitation for very large-scale networks. In the nextsubsection, we propose a generalized framework to reducethe communication cost from $ (#3) to $ (#2) as well as toreduce the computation cost from $ (#3 log2 # log log#) to$ (#2 log2 # log log#).

B. The Generalized BREA Framework

The key idea of the generalized framework is to partition theset of # (# − 1)/2 pairwise distances into sets of size , andembed the distances in a single polynomial. Consequently,the number of polynomials to embed the # (# −1)/2 pairwisedistances can be reduced from $ (#2) to $ (#2/ ). Each userthen sends a single evaluation point of each polynomial to theserver, which has an $ (#3/ ) communication cost in total.By setting = $ (#), the generalized BREA framework canachieve $ (#2) communication complexity.

We now present the details of the generalized BREA frame-work. In the stochastic quantization phase, the generalizedBREA framework follows the same steps as in Section V-A.In the verifiable secret sharing phase, user 8 ∈ [#] generatessecret shares of the quantized model w8 by modifying therandom polynomial 58 in (12) as,

58 (\) = w8 +)∑

9=1

r8 9\ + 9−1 (52)

where the degree of 58 is increased from ) to +) − 1. User8 then sends a secret share of w8 to user 9 ∈ [#], denoted bys8 9 = 58 (\ 9 ).

In the secure distance computation phase, user 8 computesthe pairwise distances 3 (8)

9 :as in (16) for 9 , : ∈ [#]. Note that

there are # (#−1)2

unique distance values, due to the symmetryof the distance measure. The server and the users then agreeon a partition P1, . . . ,P� of the # (#−1)2 distances into sets ofsize , where � = ⌈ # (#−1)

2 ⌉ and |P: | = for all : ∈ [�].

⌈G⌉ denotes the smallest integer greater or equal to G. Then,each user embeds the pairwise distances in each P: into asingle polynomial, for every : ∈ {1, . . . , �}. For instance, letthe first set in the partition be P1 = {(1, 2), . . . , (1, + 1)},Then, one can define a univariate polynomial DP1 : F? → F?

11

of degree 2( +)−1)+ −1 to embed the pairwise distancesof the pairs in P1 as

DP1 (\) = ‖f1(\) − f2(\)‖2 + \‖f1 (\) − f3 (\)‖2

+ · · · + \ −1‖f1 (\) − f +1(\)‖2. (53)

As the coefficients from the first degree to the ( − 1)-thdegree terms in (52) are zero, the coefficient of the (: − 1)-thdegree term in (53) corresponds to the pairwise distance ofthe :-th pair in P1, i.e., ‖f1 (0) − f:+1(0)‖2 = ‖w1 − w:+1‖2for all : ∈ [ ]. User 8 then sends DP1 (\8) to the server. Uponreceiving the computation results from a sufficient number ofusers, the server can decode the pairwise distances of all pairsin P1 by reconstructing the polynomial DP1 (\8). As the degreeof DP1 is 3 + 2) − 3, the minimum number of results theserver needs to collect from the users to recover the pairwisedistances, i.e., the recovery threshold, is 3 + 2) − 2. Thedecoding process of the polynomial corresponds to decodingan [#, 3 + 2) − 3, # − 3 − 2) + 2] ? Reed-Solomon codewith at most � erasures and at most � errors. In a similarway, we can define polynomials DP: for : ∈ {2, . . . , �} whereuser 8 computes and sends DP: (\8) to the server. The servercan then decode all of the # (# − 1)/2 pairwise distances byreconstructing the � polynomials.

After the server learns the pairwise distances, the gener-alized protocol follows the same steps as in Sections V-Dand V-E. The overall algorithm of the generalized BREAframework is presented in Algorithm 2.

The generalized BREA framework achieves a com-munication cost of $ (#2) and computation cost of$ (#2 log2 # log log#) by setting = $ (#), which followsfrom the following observations. First, user 8 sends an evalu-ation point of each polynomial DP: to the server for : ∈ [�],and there are # users and � = $ (#2/ ) = $ (#) polyno-mials, which has $ (#2) communication overhead in total.Second, the computation cost to decode the polynomial DP: is$ (# log2 # log log#) [40] and there are � = $ (#) polynomi-als, which has a computation cost of $ (#2 log2 # log log#) intotal. We summarize the complexity of the generalized BREAprotocol in Table II.

VII. EXPERIMENTS

In this section, we demonstrate the convergence and re-silience properties of BREA compared to conventional fed-erated learning, i.e., the federated averaging scheme from [2],which is termed FedAvg throughout the section. We measurethe performance in terms of the cross entropy loss evaluatedover the training samples and the model accuracy evaluatedover the test samples, with respect to the iteration index, C.

Network architecture: We consider an image classificationtask with 10 classes on the MNIST dataset [42], and train aconvolutional neural network with 6 layers [2] including two5 × 5 convolutional layers with stride 1, where the first andthe second layers have 32 and 64 channels, respectively, andeach is followed by ReLu activation and 2 × 2 max poolinglayer. It also includes a fully connected layer with 1024 unitsand ReLu activation followed by a final softmax output layer.

Algorithm 2 Generalized BREAinput Local dataset D8 of users 8 ∈ [#], number of iterations

�.output Global model (w(� ) ).

1: for iteration C = 0, . . . , � − 1 do2: for user 8 = 1, . . . , # do3: Download the global model w(C) from the server.4: Create a local update w(C)

8from (3).

5: Create the quantized model w8 = q(@ ·&@(w8)) from(10).

6: Generate secret shares {s8 9 = 58 (\ 9 )} 9∈[# ] from (52)and send s8 9 to user 9 ∈ [#].

7: Generate commitments {c8 9 } 9∈[) ] from (14) andbroadcast {c8 9 } 9∈[) ] to all users.

8: Verify the secret shares {s 98} 9∈[# ] by testing (15).9: Compute {3 (8)

9 :} 9 ,:∈[# ] from (16) and send the results

to the server.10: Server recovers {3 9 : } 9 ,:∈[# ] in (18) from the coeffi-

cients of polynomials DP1 , . . . , DP� in (53) by utilizingReed-Solomon decoding.

11: Server converts {3 9 :} 9 ,:∈[# ] from the finite fieldto the real domain to obtain the pairwise distances{3 9 :} 9 ,:∈[# ] from (19).

12: Server selects the set S by utilizing the multi-Krum algorithm [10] based on the pairwise distances{3 9 :} 9 ,:∈[# ] .

13: Server broadcasts the set S to all users.14: for user 8 = 1, . . . , # do15: Compute s8 =

∑

9∈S s 98 from (26) and send the resultto the server.

16: Server recovers∑

8∈S w(C)8

from the computation results{s8}8∈[# ] by utilizing Reed-Solomon decoding.

17: Server updates the global model, w(C+1) = w(C) −W (C)

@q−1

(∑

8∈S w(C)8

)

.

Experiment setup: We assume a network of # = 40 userswhere ) = 7 users may collude and � users are malicious. Weconsider two cases for the number of Byzantine users: i) 0%Byzantine users (� = 0) and ii) 30% Byzantine users (� = 12).For i.i.d. data distribution, 60000 training samples are shuffledand then partitioned into # = 40 users each receiving 1500samples. Honest users utilize the ADAM optimizer [43] toupdate the local update by setting the size of the local mini-batch sample to |b (C)

8| = 500 for all 8 ∈ [#], C ∈ [�] where

� is the total number of iterations. Byzantine users generatevectors uniformly at random from F3? where we set the fieldsize ? = 232 − 5, which is the largest prime within 32 bits.For both schemes, BREA and FedAvg, the number of modelsto be aggregated is set to < = |S| = 13 < # − 2� − 2.FedAvg randomly selects < models at each iteration whileBREA selects < users from (24).

Convergence and robustness against Byzantine users: Fig-ure 2 shows the test accuracy of BREA and FedAvg fordifferent number of Byzantine users. We can observe thatBREA with 0% and 30% Byzantine users is as efficient as

12

Fig. 2. Test accuracy of BREA and FedAvg [2] for different number ofByzantine users with i.i.d. MNIST dataset.

Fig. 3. Convergence of BREA and FedAvg [2] for different number ofByzantine users with i.i.d. MNIST dataset.

FedAvg with 0% Byzantine users, while FedAvg does nottolerate Byzantine users. Figure 3 presents the cross entropyloss for BREA versus FedAvg for different number of Byzan-tine users. We omit the FedAvg with 30% Byzantine usersas it diverges. We observe that BREA with 30% Byzantineusers achieves convergence with comparable rate to FedAvgwith 0% Byzantine users, while providing robustness againstByzantine users and being privacy-preserving. For all cases ofBREA in Figures 2 and 3, we set the quantization value in (9)to @ = 1024. Figure 4 further illustrates the cross entropy lossof BREA for different values of quantization parameter @. Wecan observe that BREA with a larger value of @ has betterperformance because the variance caused by the quantizationfunction &@ defined in (9) gets smaller as @ increases. On theother hand, given the field size ?, the quantization parameter@ should be less than a certain threshold to ensure (23) holds.Non-i.i.d dataset: To investigate the performance of BREAwith non-i.i.d dataset, we partition the MNIST data over# = 100 users by using the same non-i.i.d setting of [2].In this setting, 60000 training samples are sorted by theirdigit labels and partitioned into 200 shards of size 300. Then,# = 100 users are considered, where each user receives twoshards with different digit labels. We assume ) = 20 users may

Fig. 4. Convergence of BREA for different values of the quantizationparameter @ in (9) with 30% Byzantine users and i.i.d. MNIST dataset.

Fig. 5. Test accuracy of BREA and FedAvg [2] for different number ofByzantine users with non-i.i.d MNIST dataset.

collude and consider two cases for the number of Byzantineusers: i) 0% Byzantine users (� = 0) and ii) 20% Byzantineusers (� = 20). Honest users utilize the ADAM optimizer toupdate the model by setting the size of the local mini-batchsample to |b (C)

8| = 10 and local iterations to 5 (i.e., each user

locally updates the model 5 times before sending to the serverat each iteration). Byzantine users generate vectors uniformlyat random from F3? . For both schemes, BREA and FedAvg,the number of models to be aggregated is set to < = 50.Figure 5 shows the test accuracy of BREA and FedAvg fordifferent number of Byzantine users. Even for the non-i.i.d.dataset, BREA with 0% and 20% Byzantine users empiricallydemonstrates comparable test accuracy to FedAvg with 0%Byzantine users.

CIFAR-10 dataset: To investigate the performance of BREAwith a larger dataset, we additionally experiment with CIFAR-10 [44], by using the same model architecture of [4] (about106 model parameters). We use the same setting for parameters#, �,

13

Fig. 6. Test accuracy of BREA and FedAvg [1] for different number ofByzantine users with CIFAR-10 dataset.

VIII. CONCLUSION

This paper presents the first single-server solution forByzantine-resilient secure federated learning. Our frameworkis based on a verifiable secure outlier detection strategy toguarantee robustness of the trained model against Byzantinefaults, while protecting the privacy of the individual users. Weprovide the theoretical convergence guarantees and the funda-mental performance trade-offs of our framework, in terms ofthe number of Byzantine adversaries and the user dropouts thesystem can tolerate. In our experiments, we have implementedour system in a distributed network by using two ways ofpartitioning the MNIST dataset over up to # = 100 users:i.i.d. and non-i.i.d. data distribution. For both settings, we nu-merically demonstrated the convergence behaviour while pro-viding robustness against Byzantine users and being privacy-preserving. Future directions include developing single-serverByzantine-resilient secure learning architectures consideringthe heterogeneous environments in terms of computation andcommunication resources, developing efficient communicationarchitectures, and quantization techniques.

In this paper, we utilize the distance based outlier detectionapproach to guarantee the robustness against Byzantine users.There have been many outlier detection approaches for Byzan-tine resilient SGD methods (or federated learning). Existingworks can be roughly classified into two main categories: 1)geometric median [10], [11] and 2) coordinate-wise median[12]–[14]. Our framework utilizes the geometric median basedapproach because it is more efficient in the secure domain. Inorder to preserve the privacy of local updates, the coordinate-wise median based approaches require a multiparty securecomparison protocol [45] to find the median value of eachcoordinate. On the other hand, in our framework, the servercompares the pairwise distances in plaintext, which is muchmore efficient than the secure comparison protocol. Extendingthe coordinate-wise median based algorithms to the securedomain would be a very interesting future direction.

ACKNOWLEDGEMENTThis material is based upon work supported by Defense

Advanced Research Projects Agency (DARPA) under ContractNo. HR001117C0053, ARO award W911NF1810400, NSF

grants CCF-1703575 and CCF-1763673, ONR Award No.N00014-16-1-2189, and research gifts from Intel and Face-book. The views, opinions, and/or findings expressed are thoseof the author(s) and should not be interpreted as representingthe official views or policies of the Department of Defense orthe U.S. Government.

REFERENCES

[1] J. So, B. Güler, and A. S. Avestimehr, “Byzantine-resilient securefederated learning,” IEEE Journal on Selected Areas in Communications,2020.

[2] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-Efficient Learning of Deep Networks from Decen-tralized Data,” in Proceedings of the 20th International Conferenceon Artificial Intelligence and Statistics, ser. Proceedings of MachineLearning Research, vol. 54, Fort Lauderdale, FL, USA, Apr 2017, pp.1273–1282.

[3] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan,S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregationfor privacy-preserving machine learning,” in ACM SIGSAC Conf. onComp. and Comm. Security. ACM, 2017, pp. 1175–1191.

[4] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efficient learning of deep networks from decentralizeddata,” in Int. Conf. on Artificial Int. and Stat. (AISTATS), 2017, pp.1273–1282.

[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan et al.,“Towards federated learning at scale: System design,” in 2nd SysMLConf., 2019.

[6] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N.Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al.,“Advances and open problems in federated learning,” arXiv preprintarXiv:1912.04977, 2019.

[7] J. So, B. Guler, and A. S. Avestimehr, “Turbo-aggregate: Breakingthe quadratic aggregation barrier in secure federated learning,” arXivpreprint arXiv:2002.04156, 2020.

[8] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:Challenges, methods, and future directions,” IEEE Signal ProcessingMagazine, vol. 37, no. 3, pp. 50–60, 2020.

[9] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ram-age, and F. Beaufays, “Applied federated learning: Improving googlekeyboard query suggestions,” arXiv preprint arXiv:1812.02903, 2018.

[10] P. Blanchard, E. M. E. Mhamdi, R. Guerraoui, and J. Stainer, “Machinelearning with adversaries: Byzantine tolerant gradient descent,” in Ad-vances in Neural Information Processing Systems, 2017, pp. 119–129.

[11] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learningin adversarial settings: Byzantine gradient descent,” Proceedings of theACM on Measurement and Analysis of Computing Systems, vol. 1, no. 2,pp. 1–25, 2017.

[12] D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Byzantine-robust dis-tributed learning: Towards optimal statistical rates,” in Proceedings of the35th International Conference on Machine Learning, vol. 80, StockholmSweden, 10–15 Jul 2018, pp. 5650–5659.

[13] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradientdescent,” in Advances in Neural Information Processing Systems, 2018,pp. 4613–4623.

[14] H. Yang, X. Zhang, M. Fang, and J. Liu, “Byzantine-resilient stochasticgradient descent for distributed learning: A lipschitz-inspired coordinate-wise median approach,” in 2019 IEEE 58th Conference on Decision andControl (CDC). IEEE, 2019, pp. 5832–5837.

[15] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” in Advancesin Neural Information Processing Systems, 2019, pp. 14 774–14 784.

[16] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyondinferring class representatives: User-level privacy leakage from federatedlearning,” in IEEE Conference on Computer Communications (INFO-COM’2019), 2019, pp. 2512–2520.

[17] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Invertinggradients – How easy is it to break privacy in federated learning?” arXivpreprint arXiv:2003.14053, 2020.

[18] P. Feldman, “A practical scheme for non-interactive verifiable secretsharing,” in 28th Annual Symposium on Foundations of ComputerScience. IEEE, 1987, pp. 427–438.

[19] A. Shamir, “How to share a secret,” Communications of the ACM,vol. 22, no. 11, pp. 612–613, 1979.

14

[20] L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “Rsa: Byzantine-robust stochastic aggregation methods for distributed learning fromheterogeneous datasets,” in Proceedings of the AAAI Conference onArtificial Intelligence, vol. 33, 2019, pp. 1544–1551.

[21] D. Data and S. Diggavi, “Byzantine-resilient sgd in high dimensions onheterogeneous data,” arXiv preprint arXiv:2005.07866, 2020.

[22] G. Ács and C. Castelluccia, “I have a dream! (differentially private smartmetering),” in International Workshop on Information Hiding. Springer,2011, pp. 118–132.

[23] W. Diffie and M. Hellman, “New directions in cryptography,” IEEETrans. on Inf. Theory, vol. 22, no. 6, pp. 644–654, 1976.

[24] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, andD. Bacon, “Federated learning: Strategies for improving communicationefficiency,” in Conference on Neural Information Processing Systems:Workshop on Private Multi-Party Machine Learning, 2016.

[25] K. Bonawitz, F. Salehi, J. Konečnỳ, B. McMahan, and M. Gruteser,“Federated learning with autotuned communication-efficient secure ag-gregation,” arXiv preprint arXiv:1912.00131, 2019.

[26] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federatedlearning: A client level perspective,” arXiv preprint arXiv:1712.07557,2017.

[27] Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan, “Can you reallybackdoor federated learning?” arXiv preprint arXiv:1911.07963, 2019.

[28] C. Dwork, “Differential privacy: A survey of results,” in Internationalconference on theory and applications of models of computation.Springer, 2008, pp. 1–19.

[29] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,”arXiv preprint arXiv:1902.00146, 2019.

[30] T. Li, M. Sanjabi, and V. Smith, “Fair resource allocation in federatedlearning,” arXiv preprint arXiv:1905.10497, 2019.

[31] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergenceof fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.

[32] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyz-ing federated learning through an adversarial lens,” arXiv preprintarXiv:1811.12470, 2018.

[33] M. Fang, X. Cao, J. Jia, and N. Z. Gong, “Local model poison-ing attacks to Byzantine-robust federated learning,” arXiv preprintarXiv:1911.11815, 2019.

[34] L. He, S. P. Karimireddy, and M. Jaggi, “Secure Byzantine-robustmachine learning,” arXiv preprint arXiv:2006.04747, 2020.

[35] J. Bell, K. Bonawitz, A. Gascón, T. Lepoint, andM. Raykova, “Secure single-server aggregation with (poly)logarithmicoverhead,” IACR Cryptol. ePrint Arch., 2020. [Online]. Available:https://eprint.iacr.org/2020/704

[36] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd:Communication-efficient sgd via gradient quantization and encoding,”in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.

[37] D. Dolev and H. R. Strong, “Authenticated algorithms for Byzantineagreement,” SIAM Journal on Computing, vol. 12, no. 4, pp. 656–666,1983.

[38] G. F. Coulouris, J. Dollimore, and T. Kindberg, Distributed systems:concepts and design. pearson education, 2005.

[39] P. Blanchard, E. M. E. Mhamdi, R. Guerraoui, andJ. Stainer, “Byzantine-tolerant machine learning,” arXiv preprintarXiv:1703.02757, 2017.

[40] S. Gao, “A new algorithm for decoding reed-solomon codes,” in Com-munications, information and network security. Springer, 2003, pp.55–68.

[41] L. Bottou, “Online learning and stochastic approximations,” Onlinelearning in neural networks, vol. 17, no. 9, p. 142, 1998.

[42] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digitdatabase,” [Online]. Available: http://yann. lecun. com/exdb/mnist,vol. 2, 2010.

[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in 3rd International Conference on Learning Representations, ICLR2015, San Diego, CA, USA, May 7-9, 2015. [Online]. Available:http://arxiv.org/abs/1412.6980

[44] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Citeseer, Tech. Rep., 2009.

[45] F. Kerschbaum, D. Biswas, and S. de Hoogh, “Performance comparisonof secure comparison protocols,” in 2009 20th International Workshopon Database and Expert Systems Application. IEEE, 2009, pp. 133–136.

APPENDIX

A. The BREA Framework with Local Updating Schemes

In this section, we present how BREA framework can workwith local updating schemes where each user locally takesmultiple-steps of stochastic gradient descent using its localdata before sending the updated model to server.

The BREA framework with local updating schemes followsthe same steps from Section V-A to Section V-E by changingthe local update w(C)

8of user 8 from a gradient estimate to the

updated model by using multiple-steps of stochastic gradientdescent (SGD). Then, the update equation of a benign user 8can be expressed as follows,

v(C+1)8

= w(C)8

− WC6(w(C)8 , b(C)8

) (54)

w(C+1)8

=

{

1<

∑

8∈S (C+1) v(C+1)8

if C + 1 ∈ J�v(C+1)8

if C + 1 ∉ J�(55)

where J� denotes the set of global synchronization steps,i.e., J� = {=� | = = 1, 2, · · · }, � is the number of localiterations, and S (C+1) is the index set of selected users atiteration C +1 by utilizing the multi-Krum algorithm describedin Section V-D. An additional vector v(C+1)

8is introduced to

represent the immediate result of one step SGD update fromw

(C)8

. Malicious user 8 can send any random vector b(C+1)8

∈ R3 ,which we represent as v(C+1)

8= b

(C+1)8

. We note that thedistance-based outlier detection algorithm of which inputs arethe local models instead of local gradient works effectivelyas the distance between the local models is equal to themultiplication of the learning rate W and the distance betweenthe local gradients. Consequently, the aggregation step at theserver side can be represented as

w(C) =∑

8∈S (C)w

(C)8

if C ∈ J� . (56)

https://eprint.iacr.org/2020/704http://arxiv.org/abs/1412.6980

I IntroductionII Related WorkIII BackgroundIV Problem FormulationV The BREA FrameworkV-A Stochastic QuantizationV-B Verifiable Secret Sharing of the User ModelsV-C Secure Distance ComputationV-D User Selection at the ServerV-E Secure Model Aggregation

VI Theoretical AnalysisVI-A Complexity AnalysisVI-B The Generalized BREA Framework

VII ExperimentsVIII ConclusionReferencesAppendixA The BREA Framework with Local Updating Schemes

BYZANTINE-RESILIENT SECURE FEDERATED LEARNING · Byzantine-resilience,privacy,and...

Documents

Transcript of BYZANTINE-RESILIENT SECURE FEDERATED LEARNING · Byzantine-resilience,privacy,and...