Adaptive and Blind Regression for Mobile Crowd...

15
IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing 3 Shan Chang , Member, IEEE, Chao Li, Hongzi Zhu , Member, IEEE, and Hang Chen 4 Abstract—In mobile crowd sensing (MCS) applications, a public model of a system is expected to be derived from observations 5 collected by mobile device users, through regression modeling. For example, a model describing the relationship between running 6 speed, heart rate, height, and weight of runner can be constructed using MCS data collected from wristbands. Unique features of MCS Q1 7 data bring regression new challenges. First, observations are error-prone and private, making it of great difficulty to derive an accurate 8 model without acquiring raw data. Second, observations are nonstationary and opportunistically, calling for an adaptive model updating 9 mechanism. Last, mobile devices are resource-constrained, posing an urgent demand for lightweight regression. We propose an Q2 10 adaptive and blind regression scheme. The core idea is first to select an optimal ‘safe’ subset of observations locally stored over all 11 participants, such that the inconsistency between the subset and the corresponding regression model is minimized, and as many 12 observations as possible are included. Then, based on the resulted regression model, more observations are checked and selected to 13 refine the model. With observations constantly coming, newly selected ‘safe’ observations are used to make the model updated 14 adaptively. To preserve data privacy, one-time pad masking and blocking scheme are integrated. 15 Index Terms—Mobile crowd sensing, blind regression, adaptive model updating, outlier, opportunistic sensing, non-stationary Ç 16 1 INTRODUCTION 17 T HE paradigm of mobile crowd sensing (MCS) empowers 18 ordinary people to contribute data sensed or generated 19 from their mobile devices, e.g., mobile phones, smart vehicles, 20 wearable devices, etc. Such sensory data, called observations, 21 can be aggregated and fused on a server for large-scale sens- 22 ing or community intelligence mining, for example, smart 23 home controlling [1], air pollution estimation [2], and biomed- 24 ical data based health condition forecasting [3], [4], etc. 25 One of the commonly used statistical learning methods is 26 regression, which can be used to estimate the relationship 27 between a dependent variable and multiple independent 28 variables based on collected MCS data. For example, a fuel- 29 saving navigation service relies on MCS data collected from 30 smart vehicles to construct the linear model between fuel 31 cost and routing decision. Or, a blood pressure predicting 32 service relies on MCS data collected from smart sphygmom- 33 anometers to construct a model between hour of sleep, age, 34 salt intake and blood pressure. However, MCS application 35 scenarios pose four rigid requirements to a practical regres- 36 sion estimator as follows. 1) Supreme reliability upon outliers: 37 untrained participants are enrolled in sensing task with 38 Commercial-Off-The-Shelf (COTS) mobile devices equipped 39 with low-end sensors, leading to untrustworthy low-quality 40 or outlier observations due to unintentional mistakes (e.g., 41 keystroke errors, misplaced decimal points, or wrong data 42 representation) or device limitations. To make it worse, it is 43 hard to know to what extend the sensory data are contami- 44 nated. The regression estimator should work reliably to 45 derive accurate models with low-quality data. 2) Adaptive 46 mechanism: Observations, which are collected in an opportu- 47 nistic fashion when cheap wireless communication is avail- 48 able or when some specific conditions occur, are naturally 49 non-stationary, which means that both the distribution of 50 observations, and the possibility an outlier occurs might 51 change over time. The estimator should be able to deal with 52 ever-coming and ever-changing observations and take an 53 adaptive methodology to develop models over time. 3) 54 Strong privacy preservation: as observations are often 55 obtained through mobile devices, they are highly related 56 with the private and sensitive information of mobile device 57 users (e.g., current location, health status, etc.). The regres- 58 sion estimator should strongly preserve the privacy of MCS 59 participants by keeping raw observations locally stored and 60 processed. 4) Ultralow overheads: mobile devices are key to 61 MCS regression tasks, where they are not only involved in 62 sensing tasks but also take part in regression modeling, but 63 they are also resource constrained in terms of power, 64 computational and communication capabilities. The estima- 65 tor should take the limits of mobile devices into consider- 66 ation while conducting the regression modeling. 67 In the literature, regression problems with outliers and 68 with privacy-preservation are investigated separately. For 69 example, several secure regression methods [3], [5], [6] 70 have been proposed for mining distributed datasets or for 71 crowd-sourced systems, in which high-quality data are 72 assumed. In contrast, a number of schemes [7], [8] have been 73 proposed for outlier detection and diagnosis but without S. Chang, C. Li, and H. Chen are with the School of Computer Science & Technology, Donghua University, Shanghai 201620, P. R. China. E-mail: [email protected], {chaoli, chenhang}@mail.dhu.edu.cn, [email protected]. H. Zhu is with the Department of Computer Science and Technology, Shanghai Jiao Tong University, Shanghai 200000, P. R. China. Manuscript received 22 Jan. 2019; revised 9 June 2019; accepted 10 July 2019. Date of publication 0 . 0000; date of current version 0 . 0000. (Corresponding author: Shan Chang.) Recommended for acceptance by J. Huang. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TMC.2019.2931341 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019 1 1536-1233 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of Adaptive and Blind Regression for Mobile Crowd...

Page 1: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

1 Adaptive and Blind Regression for2 Mobile Crowd Sensing3 Shan Chang ,Member, IEEE, Chao Li, Hongzi Zhu ,Member, IEEE, and Hang Chen

4 Abstract—In mobile crowd sensing (MCS) applications, a public model of a system is expected to be derived from observations

5 collected by mobile device users, through regression modeling. For example, a model describing the relationship between running

6 speed, heart rate, height, and weight of runner can be constructed using MCS data collected from wristbands. Unique features of MCS

Q1 7 data bring regression new challenges. First, observations are error-prone and private, making it of great difficulty to derive an accurate

8 model without acquiring raw data. Second, observations are nonstationary and opportunistically, calling for an adaptive model updating

9 mechanism. Last, mobile devices are resource-constrained, posing an urgent demand for lightweight regression. We propose an

Q210 adaptive and blind regression scheme. The core idea is first to select an optimal ‘safe’ subset of observations locally stored over all

11 participants, such that the inconsistency between the subset and the corresponding regression model is minimized, and as many

12 observations as possible are included. Then, based on the resulted regression model, more observations are checked and selected to

13 refine the model. With observations constantly coming, newly selected ‘safe’ observations are used to make the model updated

14 adaptively. To preserve data privacy, one-time pad masking and blocking scheme are integrated.

15 Index Terms—Mobile crowd sensing, blind regression, adaptive model updating, outlier, opportunistic sensing, non-stationary

Ç

16 1 INTRODUCTION

17 THE paradigm of mobile crowd sensing (MCS) empowers18 ordinary people to contribute data sensed or generated19 from their mobile devices, e.g., mobile phones, smart vehicles,20 wearable devices, etc. Such sensory data, called observations,21 can be aggregated and fused on a server for large-scale sens-22 ing or community intelligence mining, for example, smart23 home controlling [1], air pollution estimation [2], and biomed-24 ical data based health condition forecasting [3], [4], etc.25 One of the commonly used statistical learning methods is26 regression, which can be used to estimate the relationship27 between a dependent variable and multiple independent28 variables based on collected MCS data. For example, a fuel-29 saving navigation service relies on MCS data collected from30 smart vehicles to construct the linear model between fuel31 cost and routing decision. Or, a blood pressure predicting32 service relies on MCS data collected from smart sphygmom-33 anometers to construct a model between hour of sleep, age,34 salt intake and blood pressure. However, MCS application35 scenarios pose four rigid requirements to a practical regres-36 sion estimator as follows. 1) Supreme reliability upon outliers:37 untrained participants are enrolled in sensing task with38 Commercial-Off-The-Shelf (COTS) mobile devices equipped

39with low-end sensors, leading to untrustworthy low-quality40or outlier observations due to unintentional mistakes (e.g.,41keystroke errors, misplaced decimal points, or wrong data42representation) or device limitations. To make it worse, it is43hard to know to what extend the sensory data are contami-44nated. The regression estimator should work reliably to45derive accurate models with low-quality data. 2) Adaptive46mechanism: Observations, which are collected in an opportu-47nistic fashion when cheap wireless communication is avail-48able or when some specific conditions occur, are naturally49non-stationary, which means that both the distribution of50observations, and the possibility an outlier occurs might51change over time. The estimator should be able to deal with52ever-coming and ever-changing observations and take an53adaptive methodology to develop models over time. 3)54Strong privacy preservation: as observations are often55obtained through mobile devices, they are highly related56with the private and sensitive information of mobile device57users (e.g., current location, health status, etc.). The regres-58sion estimator should strongly preserve the privacy of MCS59participants by keeping raw observations locally stored and60processed. 4) Ultralow overheads: mobile devices are key to61MCS regression tasks, where they are not only involved in62sensing tasks but also take part in regression modeling, but63they are also resource constrained in terms of power,64computational and communication capabilities. The estima-65tor should take the limits of mobile devices into consider-66ation while conducting the regression modeling.67In the literature, regression problems with outliers and68with privacy-preservation are investigated separately. For69example, several secure regression methods [3], [5], [6]70have been proposed for mining distributed datasets or for71crowd-sourced systems, in which high-quality data are72assumed. In contrast, a number of schemes [7], [8] have been73proposed for outlier detection and diagnosis but without

� S. Chang, C. Li, and H. Chen are with the School of Computer Science &Technology, Donghua University, Shanghai 201620, P. R. China.E-mail: [email protected], {chaoli, chenhang}@mail.dhu.edu.cn,[email protected].

� H. Zhu is with the Department of Computer Science and Technology,Shanghai Jiao Tong University, Shanghai 200000, P. R. China.

Manuscript received 22 Jan. 2019; revised 9 June 2019; accepted 10 July 2019.Date of publication 0 . 0000; date of current version 0 . 0000.(Corresponding author: Shan Chang.)Recommended for acceptance by J. Huang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TMC.2019.2931341

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019 1

1536-1233� 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof74 considering the data privacy issue. Recently, an outlier-toler-

75 ant blind regression scheme, PURE, is proposed [10]. How-76 ever, PURE fails to consider the opportunistic and non-77 stationary features of MCS data. For PURE, well-estimated78 models cannot be updated with new coming data. In order79 to keep up-to-date, newmodels have to be re-estimated. As a80 result, to the best of our knowledge, there exists no successful81 solution, to tackling regression tasks inMCS settings.82 In this paper, we consider typical MCS scenarios where83 sensory data are collected with mobile devices of unprofes-84 sional participants who can communicate with the MCS85 server via WiFi and 3G/4G. A regression estimator is pro-86 posed, called Lotus, that can be implemented as a set of proto-87 cols running between the server and mobile devices of88 participants. To deal with opportunistic and non-stationary89 observations, as illustrated in Fig. 1, Lotus estimates a model90 which can update adaptively with newly collected observa-91 tions periodically. More specifically, a two-step estimation is92 made for building an original regression model. First, we93 propose an optimal-safe-subset selection problem. By solv-94 ing the problem, an optimal ‘safe’ (not-likely-to-be-outlier)95 subset of observations distributed over all participants can96 be determine. Considering finding such an optimal-safe-sub-97 set is NP-hard, a distributed greedy hill-climbing algorithm98 is proposed ,where a Mahalanobis distance-based selection99 protocol is carried out to decide a primary ‘safe’ subset, by

100 which amodel is estimated as the starting point of hill-climb-101 ing, and then thismodel is optimized as observations outside102 the ‘safe’ subset, which support the model more strongly,103 being swapped in the ‘safe’ subset iteratively, until no obser-104 vations can be exchanged. We estimate a rough global105 regression model based on the final ‘safe’ subset. Second, we106 use this rough model to examine the validity (outlyingness)107 of local observations outside the optimal ‘safe’ subset. Those108 observations which are considered valid will be utilized to109 refine the global estimate. Moreover, the model updates110 once enough new observations are available. During model111 updating, a new ‘safe’ subset of fresh observations is first112 determined in the same way as adopted in estimating the113 original model to combine with the current model. With the114 combined model, which is considered as the new rough115 model, each participant re-examines the validity of new116 observations and old observations that have already been117 involved in the original model as well, for the purpose of118 updating the original model.119 We emphasize that Lotus will not require participants to120 share their raw observations with any others including the121 server because of the constraint of privacy-preservation,122 which poses a challenge to achieving reliable model esti-123 mates on contaminated observations (without acquiring the

124raw data, it is hard to identify and further remove outliers,125making the regression result unreliable). Lotus tackles this126challenge through three key strategies: 1) participants in127Lotus not only take part in collecting observations but also128collaborate with each other and with the server in making129decisions. 2) In Lotus, a one-time padmaskingmechanism is130introduced, such that secure aggregation can be achieved. 3)131a blocking scheme is proposed to enable blind optimization132of ‘safe’ subset. Another challenge is to determine which133observations are out-of-date and which are effective in134updating the model. Furthermore, how to eliminate the135impact of those out-of-date observations from the newmodel136and how to affect the newmodel with new effective observa-137tions in an incremental way are of great difficulty. In Lotus,138aggregation of local observations is designed to leverage the139additive property of multiplication of partitioned matrices.140As a result, new valid observations can be added in the new141refined model while out-of-date observations can be142removed from the newmodel in an incremental way.143The main advantages of Lotus are four-fold. First, Lotus is144resistant to high break-down outliers. Second, the confidenti-145ality of raw observations is strongly protected. Third, regres-146sion model can be incrementally updated and improved147over time. Last but not least, Lotus is a lightweight protocol148tailored for mobile devices. We conduct intensive analysis to149prove that Lotus can protect private data from being spied150and can defend against collusion attacks. We evaluate the151performance of Lotus through both trace-driven simulations152and real-world implementation. The results demonstrate the153effectiveness and robustness of Lotus to model updating and154in the presence of outliers even under a ratio of 40 percent.155The remainder of this paper is organized as follows. In156Section 2, we introduce problem formulation and prelimi-157naries. Section 3 describes the design of Lotus. In Section 4,158we presents the analysis of computational complexity and159the security. Section 5 shows the performance evaluation. In160Section 6, we elaborate the prototype implementation of161Lotus. We review related work in Section 7. Section 8 con-162cludes the paper.

1632 PROBLEM FORMULATION AND PRELIMINARIES

1642.1 Privacy and Threat Model165The main privacy leakage concerns during regression come166from inside adversaries which take part in the estimating167and updating procedures. More specifically, the MCS server168and participants in collaborative regression are included.169We characterize adversaries as follows.

170� Honest-but-curious: both MCS server and participants171are considered as ’passive’ adversaries which follow172the semi-honest model. It means that they execute the173pre-designed protocols honestly but are curious174about the private sensory data of others and attempt175to learn or infer such information asmuch as possible.176� Collusive: participants may also mutual collude (or177with the MCS server) to share knowledge in order to178reveal more private information. However, we179assume that the number of participants in collusion180is limited.

1812.2 Multivariate Linear Regression182In MCS, the basic multivariate linear regression problem183refers to a set of participants N ¼ fN1; N2; . . . ; Nmg and a

Fig. 1. Illustration of Lotus used in mobile crowd sensing applications.

2 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019

Page 3: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

184 sever S. Niði ¼ 1; 2; . . . ;mÞ collects a number of its own185 observations, each of them relates to p ðp > 1Þ independent186 variables x1; x2; . . . ; xp and a dependent variable y. For exam-187 ple, the rent of a house (dependent variable y) depends on six188 (p) factors, i.e., the neighborhood (x1), size (x2), number of189 rooms (x3), distance of nearest station (x4), distance of near-190 est shopping mall (x5), and attached facilities (x6). The jth

191 observation ooðiÞj of Ni is a vector of ½xðiÞ

j;1; xðiÞj;2; . . . ; x

ðiÞj;p; y

ðiÞj �: S

192 gathers observations from participants in N, to illuminate193 underlying association between variables, by fitting a model194 to observations.We have the definition as follows:

195 Definition 1. A multivariate linear regression model in196 MCS relates to observed independent and dependent variables,

197 i.e., xxðiÞ ¼ ½xxðiÞ1 ; xx

ðiÞ2 ; . . . ; xxðiÞ

ni�T and yyðiÞ ¼ ½yðiÞ1 ; y

ðiÞ2 ; . . . ; yðiÞni �

T

198 from Ni for i ¼ 1; 2; . . . ;m, where ni represents the number of

199 observations of Ni, and xxðiÞj ¼ ½1; xðiÞ

j;1; xðiÞj;2; . . . ; x

ðiÞj;p�, such that

Y ¼ XbY ¼ Xbþ ��; (1)201201

202 where XX ¼ ½xxð1Þ; xxð2Þ; . . . ; xxðmÞ�T , YY ¼ ½yyð1Þ; yyð2Þ; . . . ; yyðmÞ�T ,203 bb ¼ ½b0;b1; . . . ;bp�T is the coefficient vector of regression204 model, �� ¼ ½�1; �2; . . . ; �~�T ,(where ~ ¼Pm

i¼1 ni), represents205 random errors with zero expectation normal distribution.

206 A classic estimator is LS, which minimizes the sum of207 squared residuals, i.e., Residual Sum of Squares (RSS), and208 leads to the estimated value of unknown bb as

bbbb ¼ ðXXTXXÞ�1XXTYY

uu ¼Xmi¼1

ðxxðiÞÞTxxðiÞ; vv ¼Xmi¼1

ðxxðiÞÞT yyðiÞ

bbbb ¼ ðuÞ�1vðuÞ�1v:

(2)

210210

211

212 RSS is calculated as Rss ¼Pm

i¼1ðeeðiÞÞT eeðiÞ, where

213 eeðiÞ ¼ ½eðiÞ1 ; eðiÞ2 ; . . . ; eðiÞni �

T , and eðiÞj , computed by y

ðiÞj � xx

ðiÞjbbb, is

214 the residual of ooðiÞj to bbb.

215 Notice that both ðxxðiÞÞTxxðiÞ and ðxxðiÞÞT yyðiÞ can be com-216 puted by each Ni locally and submitted to S for calculating217 bb, without leaking the original xxðiÞ and yyðiÞ to S.218 Sharing aggregated results, however, should also be very219 careful, since abuse of aggregated results is vulnerable to220 observations recovery attacks.

221 2.3 Blind Regression with High Break-Down Outlier222 Resistance223 An observation is considered to be a regression outlier if it224 deviates from the relation followed by the majority of the225 data. We do not restrict the fraction of outliers in observa-226 tions from certain individual. However, given a set of obser-227 vations for regression modeling, the total outliers should be228 limited up to 50 percent; otherwise, it is impossible to229 achieve a reasonable regression model.

230 Property 1 [Blind]. A regression modeling is blind if the raw231 observations cannot be obtained or inferred by any others (the232 server) during the regression.

233 Property 2 [High Break-down Outlier Resistant]. A234 regression modeling is called high break-down outlier resistant235 method if the derived relation still fits the majority of data even if

236the portion of regression outliers reaches up to 50 percent of all237observations.

238Definition 2. The problem of blind regression with high239break-down outlier resistance is referred to as, given the pri-240vate observations, finding the optimal linear regression estimate241so that it satisfies both Property 1 and 2.

2422.4 Adaptive Regression with Non-Stationary243Observations244In MCS, a regression model is updated periodically or once245enough fresh observations are available. During model246updating, fresh observations are added into the current247model for improving model accuracy.

248Property 3 [Adaptive].Model updating is called adaptive if it249can accommodate to the gradual changes of dependency between250the predictor and criterion variables, resulting from the non-251stationarity of observations.

252Definition 3. The problem of adaptive regression with non-253stationary observations is referred to as, given a group of254new observations, updating a regression model (i.e., bb) without255re-estimating from scratch, satisfying Property 3.

256Remarks. Model updating should be also blind and high257break-down outlier resistant.

2582.5 Mahalanobis Distance259The Mahalanobis distance [13] of an observation oioi =260½xi;1; . . . ; xi;p; yi�, from a set of observations OO with mean261value mm ¼ ½m1; . . . ;mp;mpþ1� and covariance matrix V, is262defined as:

dMðooiÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðooi � mmÞTV �1ðooi � mmÞ

q; (3)

264264

265oi with greater dMðooiÞ from the rest of the observations is266said to have higher leverage, and is suspected as an outlier,267since it has a greater influence on the coefficients of the268regression equation.

2692.6 Design Goals and Challenges270We aim to develop a practical method to address the prob-271lems of Definitions 2 and 3, simultaneously, which is how-272ever very hard. The raw observations are not available due273to the data confidentiality consideration, which obstructs274outlier identification. To design a regression estimator satis-275fying privacy preservation and outlier resilience is nontriv-276ial. Furthermore, model updating should be adaptive to277non-stationary observations. The current dependency278between variables should be captured precisely. It implies279the necessity to withdraw outliers or outdated observations280from the new model. Without the prior knowledge on281observation distributions, it’s difficult to identify those ’bad’282observations. Unfortunately, privacy concerns make the283problem much harder.284Additionally, in MCS scenarios, typical mobile devices285have relatively weak computation abilities and limited286power. It is necessary that methodologies for adaptive287regression-estimating and privacy-preserving are light-288weight. Especially, we desire an incremental model updat-289ing scheme, to maximize the use of existing computational290results without re-estimating from scratch, and a privacy-291preserving scheme free of complicated encryption schemes

CHANG ETAL.: ADAPTIVE AND BLIND REGRESSION FOR MOBILE CROWD SENSING 3

Page 4: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

292 (e.g., homomorphic encryption), which induce high trans-293 mission and computation cost.

294 3 DESIGN OF LOTUS

295 3.1 Overview296 Lotus incorporates two techniques, i.e., blind regression297 model estimating and blind regression updating, as illus-298 trated in Fig. 2. Notice that, during regression, we develop a299 pair of lightweight secure calculation (aggregation and com-300 parison) schemes, which is intended to ensure the ‘blind’301 property.302 Blind Regression Model Estimating. This technique is used303 to derive an original regression model according to a set of304 observations, which has two steps. 1) Estimating rough model305 with optimal ‘safe’ subset: considering that outlier ratio is306 smaller than 50 percent, we wish to choose a ‘safe’ subset307 with half observations, which satisfies that the rough model308 estimated accordingly achieves minimum RSS, implying309 that observations in selected ‘safe’ subset obey quite the310 same trend. In other words, a ‘tight’ and ‘safe’ regression311 model can be made and used as a rough original model.312 However, finding such a subset is proven NP-hard. Thus313 we develop a two-stage hill-climbing approach to achieve314 an approximation of optimal solution. First, we select a pri-315 mary ’safe’ subset, to initializing rough model as the starting316 point of hill-climbing. The server collects and aggregates317 the information of some statistics (e.g., the mean) of local318 observations from all participants to get global statistics.319 Such global statistics are then distributed to all participants320 for data quality check. The first half observations are consid-321 ered as a primary ‘safe’ subset, and the corresponding322 regression model is intended as the starting point. Then, we323 introduce a swap algorithm to approach the optimal ‘safe’ subset324 for estimating the rough original model, where observations325 inside and outside of the ‘safe’ subset will be swapped itera-326 tively, such that the RSS of the corresponding model can be327 minimized. 2) Deriving refined model: by checking the data328 quality again using the rough model estimated, more valid329 local observations can be found and used to refine the rough330 model, ultimately resulting in the original model.331 Blind Regression Updating. This technique performs two332 functions to achieve an updated model estimate. 1)

333Calculating new rough model: when updating one model, a334new ‘safe’ subset is formed according to a fresh observation335group, which follows the same procedure as in finding the336initial ‘safe’ subset. Then a new rough model can be built by337including the new ‘safe’ subset into the current model. 2)338Deriving new refined model: after establishing the new rough339model securely, each participant checks the quality of fresh340observations, and non-outliers which have been used to341build the current model, based on the new rough model342locally, to decide which observations should be involved343into or cut out from the new model, respectively.344Secure Aggregation and Comparison. In the above stages, in345order to protect data privacy, each participant masks their346local aggregated results used for model estimating with par-347ticular random values, i.e., one-time pad masking. Since those348masks from different participants can cancel each other on349the server side, upon receiving the masked local aggrega-350tions, the server can estimate a model precisely. Moreover,351a blocking scheme is proposed, which enables that residuals352of observations can be compared in a secure manner, in the353procedure of optimizing ‘safe’ subset.

3543.2 Blind Regression Model Estimating

3553.2.1 Estimating Rough Global Model with Optimized

356‘Safe’ Subset

357In our design, we wish to use a subset with minimum risk of358including dirty observations as the ‘safe’ subset. Thus, we359give the following definition:

360Definition 4. The problem of finding optimal ‘safe’ subset is361referred to as, given a set O of n observations with p (constant)362independent and 1 dependent variables, i.e., an observation363oioi ¼ ðxixi; yiÞ; ði ¼ 1; . . . ; nÞ, where xixi is a p dimension vector,

364select n2

� �observations to fit a regression model, to minimize

365corresponding RSS.

366The reason for forming a size n2

� �‘safe’ subset is two-

367folds. On one side, if the subset has less than half observa-368tions, it is possible that all observations are dirty. On the369other side, if more than half observations are selected, the370risk of including outliers is increased. However, the prob-371lem of finding optimal ‘safe’ subset is very hard. We have372the following Theorems.

373Theorem 1. The problem of finding optimal ‘safe’ subset defined374in Definition 4 is NP-hard.

375Proof. Suppose that we apply LS estimator to all observa-376tions in set O, then the corresponding RSS, i.e., Rss ¼377

Pni¼1ðeeiÞT eei, can be minimized, where ei ¼ yi � xxi

bbb, and378bbb is the regression model estimated. We can reformulate

379Rss as

Rss ¼ Y TY � bbbTXTY;

381381

382whereX ¼ ½x1x1; . . . ; xnxn�T and Y ¼ ½y1; . . . ; yn�T .383Remember that bbb ¼ ðXTXÞ�1XTY , then if we select384

n2

� �observations from set O to fit an LS regression model

385�bb, it can be obtained by using

�bb ¼ ðXTWXÞ�1XTWY;

Fig. 2. Overview of lotus.

4 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019

Page 5: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

whereW is an n� n diagonal matrix , such that

wi;i ¼ 1 if observation oioi is selected0 otherwise:

�Meanwhile, we have the corresponding RSS, i.e.,

�Rss ¼ Y TWY � �bbTXTWY

¼ Y TWY � ½ðXTWXÞ�1XTWY �TXTWY

¼ Y TWY � Y TWTX½ðXTWXÞ�1�TXTWY:

Then the problem in Definition 4 can be expressed as thefollowing non-linear 0-1 programming problem

Minimize Y TWY � Y TWTX½ðXTWXÞ�1�TXTWY

Subject toXni¼1

wi;i ¼ n

2

l m; wi;i 2 f0; 1g:

386 It is well known that non-linear 0-1 programming prob-387 lem isNP-hard [14], [15], This completes the proof. tu388 Therefore, we use a greedy hill-climbing algorithm to389 approximate the optimum solution, where an iterative algo-390 rithm that starts with a Mahalanobis distance-based pri-391 mary ‘safe’ subset, then attempts to find a better ‘safe’392 subset by making an incremental change (by conducting393 observation swapping) to the subset.394 Selecting Primary ‘Safe’ Subset: In this step, we select a sub-395 set of observations, used as the starting point of hill climbing,396 in procedure of Approaching Optimal ‘Safe’ Subset. Hill397 climbing starts from an arbitrary solution to a problem, then398 tries to find a better solution by making an incremental399 change to the current one, and so on until no further400 improvements can be made. It is critical to find a good start-401 ing point (solution) close to the globalmaxima (best solution)402 as much as possible, in order to prevent the search stop at403 local maxima, and to reduce the number of searches as well.404 We aim to find a subset which is presumably free of outliers405 as the starting point, since a subset contaminated with out-406 liers may lead to a final solution composed entirely of out-407 liers. In order to detect outliers, it is necessary to take into408 consideration the shape of the data set. For example, given409 an elliptical shape cloud of 2-dimension data points, some410 points are closer to the center than others, however, we can-411 not conclude that those large distance points (in terms of the412 classical Euclidean distance) belong less to the cloud than413 short distance ones, as this is part of the underlying pattern414 of the data point distribution. Hence, instead of the Euclid-415 ean distance, we utilize Mahalanobis distance, which takes416 into account the shape of the observations.417 Suppose Ni 2 N holds a set of observations ooðiÞ ¼ fooðiÞ1 ;

418 ooðiÞ2 ; . . . ; ooðiÞn g, where oo

ðiÞj indicates the jth observation of Ni.

419 For the convenience of expression, we set the size of ooðiÞ as n420 (which is known by S). Actually, Lotus can be easily421 extended to fit a general case where the size of each ooðiÞ might422 not be equal. Observations from them participants are repre-

423 sented by OO ¼ S mi¼1oo

ðiÞ. We utilize m�n2

� �observations from

424 OOwith smallestMahalanobis distances, i.e., dM , to form a pri-

425 mary ‘safe’ subset. Figure 3 depicts the protocol between the

426 server and each participant.427 Taking into account privacy issues, each dM should be428 calculated by the corresponding observer locally, and the429 mean value mm ¼ ðm1; . . . ;mp;mpþ1Þ and covariance matrix VV

430of OO should be available to participants. To this end, the fol-431lowing protocol is performed between each Ni and the432server S to select the primary ‘safe’ subset.

433� Calculating mm: each Ni computes the sum of each434column in ooðiÞ locally, i.e., ssðiÞ ¼ ðPn

j¼1 xðiÞj;1; . . . ;

435Pn

j¼1 xðiÞj;p;Pn

j¼1 yðiÞj Þ, and then Ni sends ss

ðiÞ to S. After

436gathering ssðiÞ from all participants, it’s convenient437for S to calculate mm ¼ 1

m�n

Pmi¼1 ss

ðiÞ. Then S broad-438casts mm to N.439� Calculating VV �1: each Ni computes VV ðiÞ ¼Pn

j¼1

440ðooðiÞj � mmÞT ðooðiÞj � mmÞ (which is a ðpþ 1Þ � ðpþ 1Þ441symmetric matrix), and submits it to S. S computes442VV by using

VV ¼ EXmi¼1

VV ðiÞ !

¼ 1

m� n

Xmi¼1

VV ðiÞ;

444444

445then calculates and broadcasts VV �1, the inverse of VV ,446to N.447� Calculating dM : each Ni computes dMðooðiÞj Þ according448to (3).449� Estimating the global median of dM : each Ni sends the

450local median of dMðooðiÞÞ, i.e., MðiÞdM

, to S. S sorts all

451dMðooðiÞj Þ received, and broadcasts the median MdM452(which is the global median) to N.453� Preparing local primary ’safe’ subset: Ni selects those

454ooðiÞj whose dM are smaller than MdM to form a local

455‘safe’ set ooðciÞ ¼ fooðciÞ1 ; ooðciÞ2 ; . . . ; oo

ðciÞ�i

g. Notice that the

456union of local primary ’safe’ subset, i.e.,457oðcÞoðcÞ ¼ S m

i¼1ooðciÞ, is the global primary subset we

458needed.459According to the primary ’safe’ subset obtained, S coop-460erates with participants on initializing rough model without

461knowing ooðcÞ.

Fig. 3. Protocol of finding primary ‘safe’ subset.

CHANG ETAL.: ADAPTIVE AND BLIND REGRESSION FOR MOBILE CROWD SENSING 5

Page 6: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

462 More specifically, Ni computes ðxxðciÞÞT ðxxðciÞÞ, ðyyðciÞÞT463 ðyyðciÞÞ and ðxxðciÞÞT yyðciÞ using its local primary ‘safe’ subset,464 and submits the results to S. S computes uu

ðcÞ0 ¼Pm

i¼1

465 ðxxðciÞÞT ðxxðciÞÞ, vvðcÞ0 ¼Pmi¼1ðxxðciÞÞT yyðciÞ, and ww

ðcÞ0 ¼Pm

i¼1ðyyðciÞÞT466 ðyyðciÞÞ. Then, it’s convenient for S to estimate a regression467 model with coefficient vector bbb0 according to (2) (i.e.,468 bbb0 ¼ ðuuðcÞ

0 Þ�1vvðcÞ0 ). Furthermore, S calculates the correspond-

469 ing RSS, i.e., Rssð0Þ , which equals to wwðcÞ0 � ðbbb0ÞT vvðcÞ0 , and will

470 be used in the next stage.471 In the above procedure, Ni needs to provide a number of472 aggregated results, e.g., ðxxðciÞÞT ðxxðciÞÞ and ðxxðciÞÞT ðyyðciÞÞ,473 which closely relate to its raw observations ooðciÞ. However,474 disclosing those aggregates may cause privacy problem if475 the number of variables in ooðiÞc is no more than that of the476 equations built from them. Thus, sharing those aggregates477 directly is prohibited.478 Considering, according to the protocol, the server adds479 up each kind of aggregated values from all participants480 together for further processing, and doesn’t care about indi-481 vidual ones, we introduce a one-time pad masking technique482 for calculating the summation of aggregated values with-483 out disclosing individual ones, which will be explained in484 Section 3.4.1.

485 Remarks. The number of observations used for building a486 regression model should be no less than 2pþ 3, which487 prevents the primary ’safe’ subset oðcÞoðcÞ from being recov-488 ered by the server S (See Theorem 2 and the proof in489 Section 4.2.1).

490 Approaching Optimal ’Safe’ Subset: we use the primary491 ’safe’ subset oðcÞoðcÞ, leading to bbb0, as the starting point of hill492 climbing, and incrementally improve ’safe’ subset towards493 ’optimal’ (as defined in Definition 4). To this end, we check494 the residuals of remaining observations, and give the fol-495 lowing definition:

496 Definition 5. Given bbb estimated by using n observations, with497 corresponding RSS denoted as Rss, Moderate Residual498 Expectation (MRE) is referred to as

ffiffiffiffiffiffiffiffiffiffiffiffiffiRss=n

p.

499 Optimizing ’safe’ subset refers to adjusting members of500 subset such that a corresponding model with minimum RSS501 can be achieved. The MRE of an observation group implies502 the average residual induced by each observation in it. For503 the purpose of reducing the RSS, those observations outside504 the primary ’safe’ subset (named ’uncertain’) whose resid-505 uals are smaller than the MRE of current regression estima-506 tion will be included into the ’safe’ subset while the same507 number of ’safe’ observations with largest residuals will be508 removed from it. As a result, the primary ’safe’ subset is509 updated towards shrinking MRE as well as RSS, and a new510 regression model can be estimated according to it. In detail,511 the following observation-swapping protocol is performed512 between eachNi and the server S,

513 � Announcing current model and corresponding MRE: S514 calculates MRE of current model bbb0, denoted as515 Mreð0Þ, by using

Mreð0Þ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiww

ðcÞ0 � ðbbb0ÞT vvðcÞ0

n

s;

517517

518 and broadcasts bbb0 andMreð0Þ to all participants.

519� Deciding local swap-in subset: each Ni calculates resid-520ual of each oo

ðiÞj to bbb0, i.e., e

ðiÞj ¼ y

ðiÞj � xx

ðiÞjbbb0, and com-

521pares those residuals with Mreð0Þ. Notice that ooðiÞ is522divided into local ‘safe’ subset ooðciÞ and remaining523subset ooðriÞ. Thus, those oo

ðriÞj whose residuals are

524smaller than Mreð0Þ are formed a swap-in subset525ooðiniÞ, and Ni submits the size of ooðiniÞ, i.e.,jooðiniÞj to S526by using one-time pad masking.527� Determining global swap-out threshold of residual: in528order to decide which observations should be529removed from ooðciÞ, a naive solution is to submit eeðciÞ

530to S. After receiving all eeðciÞ and jooðiniÞj, S computes531innum ¼Pm

i¼1 jooðiniÞj, the total number of observa-532tions should be added into ‘safe’ subset, and sorts all533residuals of ‘safe’ observations, and notifies Ni with534threshold of residuals etrsd such that innum observa-535tions whose residuals exceed etrsd should be536removed from the ‘safe’ subset.537� Updating local ’safe’ subset: on the side of Ni, accord-538ing to etrsd, a swap-out subset ooðoutiÞ can be formed,539which is composed of those observations satisfying540ee

ðciÞj > etrsd.

541� Revising model using updated ’safe’ subset: each Ni cal-

542culates ðxxðiniÞÞTxxðiniÞ � ðxxðoutiÞÞTxxðoutiÞ, ðyyðiniÞÞT yyðiniÞ�543ðyyðoutiÞÞT yyðoutiÞ and ðxxðiniÞÞT yyðiniÞ � ðxxðoutiÞÞT yyðoutiÞ, and544submits the results to S, by using one-time pad545masking. S computes uu

ðcÞ1 as follow,

uuðcÞ1 ¼ uu

ðcÞ0 þ

Xmi¼1

ððxxðiniÞÞTxxðiniÞ � ðxxðoutiÞÞTxxðoutiÞÞ

¼ uuðcÞ0 þ

Xmi¼1

ðxxðiniÞÞTxxðiniÞ �Xmi¼1

ðxxðoutiÞÞTxxðoutiÞ:547547

548

549Similarly, vvðcÞ1 and ww

ðcÞ1 are computed. Thus

550bbb1 ¼ ðuuðcÞ1 Þ�1vvðcÞ1 , and Rssð1Þ ¼ ww

ðcÞ1 � ðbbb1ÞT vvðcÞ1 .

551The above procedure will be carried out iteratively until552no ‘uncertain’ and ‘safe’ observations can be swapped in553and out of the ultimate (optimal) ’safe’ subset. The

Fig. 4. Protocol of approaching Q3optimal ‘safe’ subset.

6 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019

Page 7: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

554 regression estimate bbbsafe obtained in the last round is con-555 sidered as a rough global regression model.556 However, publishing residuals of certain observations to557 different models gives the chance for attackers to launch558 observation recovery attacks. One solution to solve the prob-559 lem is to conduct sorting on ciphertext, e.g., by using Garble560 Circuit, which however induces high computation and com-561 munication costs. We design a light-weight and effective562 blocking scheme by loosening the accuracy of ooðoutÞ, i.e.,563

S mi¼1oo

ðoutiÞ, slightly, and explain the scheme in Section 3.4.2.

564 3.2.2 Deriving Refined Global Model

565 After achieving the rough model bbbsafe (i.e., ðuuðcÞÞ�1vvðcÞ), S566 broadcasts bbbsafe and Root Mean Squared Error RMse

567 ðRMse ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Rssðn�m�p�1Þ

qÞ to all participants. Then the following

568 refining protocol is carried out between S and each Ni, in569 which the outlyingness of observations in its remaining sub-570 set ooðriÞ (i.e., ‘uncertain’) is tested in accordance with bbbsafe.571 Those observations fitting bbbsafe well will be considered as572 ’safe’, and be utilized for refining bbbsafe, which leads to an573 original estimate bbborg.574 Specifically, Ni calculates standardized residual z

ðriÞj of

575 ooðriÞj by using

zðriÞj ¼ jeðriÞj j=RMse: (4)

577577

578 Ni goes through ooðriÞ, and labels observations with zðriÞj

579 exceeding 1.69 according to [11], which implies inconsis-580 tency with bbbsafe, as outliers. Then, observations passing the581 test are formed as a refining subset ooðfiÞ, which will be582 added into the rough regression model. We emphasize that

583 ðxxðfiÞÞTxxðfiÞ and ðxxðfiÞÞT yyðfiÞ are computed locally, and are

submitted to S by using masking.584 S computes uuðorgÞ ¼ uuðcÞ þPm

i¼1ðxxðfiÞÞTxxðfiÞ and vvðorgÞ ¼585 vvðcÞ þPm

i¼1ðxxðfiÞÞT yyðfiÞ, S estimates bbborg ¼ ðuuðorgÞÞ�1vvðorgÞ.

586 3.3 Blind Regression Updating587 Under the current estimate bbborg, assume that there exists a

588 group of participants eN ¼ f eN1; eN2; . . . ; eNqg, and each

589 eNl ðl ¼ 1; 2; . . . ; qÞ holds a set of new observations, i.e.,

590 eoeoðlÞ ¼ feoeoðlÞ1 ; eoeoðlÞ2 ; . . . ; eoeoðlÞn g. It’s required that bbborg can update591 adaptively, according to the new observation set, i.e.,

592 eOeO ¼ S li¼1eoeoðlÞ. This requirement is two-fold: first, if new

593 observations show gradual change of relation between594 dependent and independent variables (resulting from non-595 stationarity of observations), the current model should596 update to accommodate the change accordingly; second, if597 not, the accuracy of the current model can be improved by598 using new observations despite outliers.599 However adaptive updating is nontrivial. First, it is very600 hard to distinguish outliers from those normal ones which601 impel the change of dependency between variables. Second,602 even the non-stationarity of observations can be verified,603 it’s very hard to tell if the change is temporary or long-term,604 thus we should neither completely accept nor ignore the605 new dependency. Furthermore, out-of-date observations in606 the current model which no longer accommodate to new607 dependency should be removed from new model, yet608 deciding those out-of-date observations is also not easy.609 To update bbborg adaptively by utilizing new observations,610 we apply a two step updating scheme, where the server S

611first calculates a new rough model ebbrgh, according to bbbrgh

612and the new ‘safe’ subset, and then refines ebbrgh to achieve613the new estimate. We explain our design in detail in the fol-614lowing subsections.

6153.3.1 Calculating New Rough Model

616In this procedure, a new ‘safe’ subset of eOeO is formed and617included into bbborg to construct the new rough model ebbrgh.618The advantage is that half new observations will be619included into the new rough model, such that the model620will reflect new dependency to some extent, while minimiz-621ing the risk of bringing outliers into the model.622Specifically, S cooperates with each eNi to select the new623optimal ‘safe’ subset (with a size of n

2

� �), following the pro-

624tocols which have been introduced in Section 3.2.1. We625denote the new optimal local ‘safe’ observation subset of eNl

626as eooðclÞ. Each eNl submits ðexexðclÞÞT exexðclÞ and ðexexðclÞÞTeyyðclÞwith627masking to S. After S obtaining the aggregates, i.e.,

628Pq

l¼1ðexexðclÞÞT exexðclÞ andPq

l¼1ðexexðclÞÞTeyyðclÞ, it estimates a new629rough model ebebrgh by using

euuðrghÞ ¼ uuðorgÞ þXql¼1

ðexxðclÞÞTexxðclÞ;

evvðrghÞ ¼ vvðorgÞ þXql¼1

ðexxðclÞÞTeyyðclÞ;ebbrgh ¼ ðeuuðrghÞÞ�1evvðrghÞ: 631631

632

6333.3.2 Deriving New Refined Model

634After updating the new rough model ebebrgh, S broadcasts it to635participants in U ¼ eN S N. Then old observations which

636have been included in ebebrgh will be rechecked, and those out-637of-date ones will be removed, while new observations which638haven’t been embedded in ebebrgh, i.e., ’uncertain’ ones, will be639tested such that non-outliers fitting ebebrgh well will be added640into themodel. In this way, a new refinedmodel is derived.641In detail, S communicates with each participant in U, to

642test the outlyingness of observations in ooðclÞS eoðrlÞeoðrlÞ, in accor-

643dance to ebebrgh, which is alike the procedure obtaining ooðfiÞ

644(described in Section 3.2.2). Those observations in ooðclÞ

645which unfit ebebrgh will be removed from the new model. On646the other side, observations in eoðrlÞeoðrlÞ which pass the test will647be added into the new model.648The advantage of the design is also two-fold. First, once an649observation is marked as out-of-date, it will be removed, and650never be brought back into the model again. It means that as651the model is updated, out-of-date observations will be652excluded from latest model gradually, and only those obser-653vations which always accord with the relation between vari-654ables can survive. Second, fresh observations can be655included into the newmodel asmuch as possible for improv-656ing model accuracy, at the same time, those fresh ones which657fail to fit into themodel (implying outliers) will be ignored.

6583.4 Secure Aggregation and Comparison

6593.4.1 One-Time Pad Masking

660Remember that in Section 3.2.1, the summation of certain661kind of local aggregated results (belonging to participants)662should be calculated by S without disclosing individual663ones. We propose a lightweight one-time pad masking

CHANG ETAL.: ADAPTIVE AND BLIND REGRESSION FOR MOBILE CROWD SENSING 7

Page 8: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof664 technique, and explain the key idea by calculating

Pmi¼1 ss

ðiÞ

665 (ssðiÞ is hold by Ni) as follows:666 The basic idea of masking is that each ssðiÞ is masked with667 certain secret value, such that all masks can be cancelled668 when they are added on the server side. Suppose each pair669 of participants ðNi;NkÞ agrees on certain random value bi;k.670 If Ni adds this to ssðiÞ, while Nk will subtract it from ssðkÞ. In671 the design, each participantNi computes:

Ai ¼ ssðiÞ þXi< k

bi;k �Xi > k

bk;i;

673673

674 and sends Ai to the server, and the server computes:

A ¼Xmi¼1

Ai ¼Xmi¼1

ðssðiÞ þXi < k

bi;k �Xi > k

bk;iÞ ¼Xmi¼1

ssðiÞ:

676676

677

678 We give a three person example in Fig. 5.N1 shares secret679 values b1;2 and b1;3 with N2 and N3, respectively, and N2 and680 N3 shares b2;3. Each person has a private number, e.g, N1

681 holds 10, and the server needs to calculated the summation682 of three private number, i.e., 10þ 20þ 30. To this end, each683 person masks its private number with all secret values shar-684 ing with others before submitting it to the server , e.g., N1

685 masks 10 with b1;2 and b1;2 (i.e., calculating 10þ b1;2 þ b1;3),686 and then reports 10þ b1;2 þ b1;3 to the server. Notice that N2

687 subtracts b1;2 from 20, sinceN1 adds it on 10. After collecting688 all masked values from participants, the server calcu-689 latesð10þ b1;2 þ b1;3Þ þ ð20� b1;2 þ b2;3Þ þ ð30� b1;3 � b2;3Þ ¼690 10þ 20þ 30.691 In order to avoid exchanging random values bi;k between692 participants, which requires quadratic communication over-693 head, each pair of participants ðNi;NkÞ shares a secret as the694 common seed in advance, such that the same pseudoran-695 dom can be generated by both parties. To this end, we696 assume that each Ni holds a pair of public and secret keys697 ðpki; skiÞ, which can be achieved by commonly used Public698 Key Infrastructure (PKI). The server maintains a public key699 list of participants. Once Ni registers with the server, pki700 will be added into the list, and the updated list will be pub-701 lished to all participants. Then, for each pkk in the list702 (except pki), Ni generates a secret si;k, encrypts it with pkk,703 and submits the ciphertext to Nk directly (or relay by the704 server). Nk decrypts the ciphertext using skk and gets si;k.705 After that, bi;k can be generated based on si;k. A simple way706 is to apply a secure hash function hð�Þ (which is known to707 all participants) on si;k, i.e., bi;k ¼ hðsi;kÞ. In consideration of

708security, bi;k will only be used once. For updating bi;k, new709mask b0i;k is calculated as hðbi;kÞ.710Remarks. In practice, it does not necessarily mask ssðiÞ

711using secrets shared with all others. Participants can be712divided into groups, such that each participant only shares713maskings with individuals belonging to the same group.

714Thus ssðiÞ is masked with secrets sharing with its group715members.

7163.4.2 Blocking

717In the third step of observation-swapping protocol, it is nec-718essary to identify innum observations with maximum resid-719uals in current global ’safe’ subset, forming a swap-out720subset. Thus S needs to sort all those observations according721to residuals, and decides a swap-out threshold of residual.722To this end, residuals of observations should be compared.723instead of collecting and comparing residuals directly,724which is vulnerable to observation recovery attacks, S runs725a blocking-based secure comparison protocol under the726cooperation of participants.727Specifically, S publishes a set of bins such that a residual728can fall into one and only one bin, furthermore, the sizes of729the bins are different, which satisfies that bigger residuals730will be thrown into smaller bins. EachNi has the knowledge731of those bins, hence can decide the number of observations732in current ’safe’ subset falling into each bin locally, and sub-733mits those numbers to S by using masking. Thus S can cal-734culate the total number of observations falling into each bin.735Since S knows the innum, which is equal to outnum, and fol-736lows the rule of small-bin-first, in which observations in737smaller bins well be removed first. S decides the borderline738bin binbd such that all observations in bins smaller and big-739ger than it will be removed and kept, respectively. While740partial observations in binbd will be removed. Then S broad-741casts the binbd and removed ratio of observations rmoratio in742it. According to this, Ni selects all observations in bins743smaller than binbd, and randomly selects observations in744binbd with a probability of rmoratio to form local swap-out745subset ooðoutiÞ.746We give an example in Fig. 6. Both Alice and Bob have747theirs local ‘safe’ subsets, the residuals of which are f1; 9g748and f8; 7; 2; 3g, respectively. Alice and Bob decide their local749swap-in subsets (i.e., observations with residuals 3 and 4,750selecting from their local ‘uncertain’ subsets, the residuals

Fig. 5. In this example of one-time pad masking, the server calculatesthe summation of 10, 20, and 30 held byN1, N2 andN3.

Fig. 6. In this example of blocking, observations with residuals 9 and 8 inthe ‘safe’ subset are replaced with observations with residuals 3 and 4 inthe ‘uncertain’ subset.

8 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019

Page 9: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

751 of which are f8; 9; 7; 3g and f4; 6g, respectively), according752 to the MRE announced by the server, and let the server753 know the total number of observations in swap-in subset,754 i.e., 2. Then the server publishes three bins of residuals, i.e.,755 ½0; 5Þ; ½5; 8�; ð8; 9�. Alice and Bob check their own ‘safe’ sub-756 sets, and repot the total numbers of observations falling in757 each bin, i.e., three, two and one observations fall into758 ½0; 5Þ; ½5; 8� and (8,9]. Since the server knows that two obser-759 vations should be swapped out from the global ‘safe’ subset,760 it decides that the borderline bin is [5,8], which means that761 the observation in bin (8,9] will be swapped out. Further-762 more, the server sets romratio ¼ 0:5, i.e., one swap-out obser-763 vation in bin [5,8] will be selected randomly. Consequently,764 Alice knows observation with residual 9 will be swapped765 out, and Bob chooses to swap the observation with resi-766 dual 8 out.

767 4 ANALYSIS

768 4.1 Computational Complexity on Mobile Devices769 In Lotus, the main computations are matrix multiplications,770 thus we use the number of multiplications and additions to771 represent the computational complexity. Since, the cost on772 initial model estimating and model evloving is exactly the773 same, we take the initial model estimating as an example to774 present the detailed analysis as follows:

775 4.1.1 Selecting Primary ‘Safe’ Subset

776 In order to pick up a primary ‘safe’ subset from local obser-777 vations, Ni needs to calculate covariance matrix VV ðiÞ and778 dM , which takes nðpþ 1Þ2 and nðpþ 1Þðpþ 2Þ multiplica-779 tions, ðn� 1Þðpþ 1Þðpþ 3Þ and npðpþ 2Þ additions, respec-780 tively. Then Ni calculates ðxxðciÞÞT ðxxðciÞÞ and ðxxðciÞÞT yyðciÞ781 based on its local primary ‘safe’ subset. In the case that all782 local observations of Ni are considered as ‘safe’, which783 means the set of all those observations, i.e., oi, will be used784 to estimate a primary rough model acting as the start point785 of the hill-climbing algorithm, nðpþ 1Þðpþ 2Þ multiplica-786 tions and ðn� 1Þðpþ 1Þðpþ 2Þ additions should be con-

787 ducted for calculating ðxxðiÞj ÞTxxðiÞ

j and ðxxðiÞj ÞT yyðiÞj in total. In

788 summary, the complexity of multiplication and addition789 operation of selecting primary ‘safe’ subset is Oðnp2Þ and790 OðnpÞ, respectively.

791 4.1.2 Approaching Optimal ‘Safe’ Subset

792 Since ðxxðiÞj ÞTxxðiÞj and ðxxðiÞ

j ÞT yyðiÞj have been calculated in the793 last step, Ni only computes n and nðpþ 1Þ multiplications,794 and n� 1 and nðpþ 1Þ additions for calculating residuals of795 its all local observations and ðyyðciÞÞT yyðciÞ in each iterations,796 respectively. In summary, the complexity of multiplication797 and addition operation of approaching optimal ‘safe’ subset798 is OðnpÞ and OðnpÞ, respectively.

799 4.1.3 Deriving Refined Global Model

800 For checking the outlyingness of local observations in the801 remaining subset of Ni, it calculates the residuals of those802 observations, i.e., eeðiÞ and z

ðiÞj , which takes nðpþ 2Þ multipli-

803 cations and n additions, in total. Consequently, the com-804 plexity of multiplication and addition operation of deriving805 refined global model is OðnpÞ and OðnÞ, respectively.

8064.2 Security Analysis

8074.2.1 Observation Recovery Attacks

808After deciding a primary ‘safe’ subset oðcÞoðcÞ, S estimates bbb0

809based on it. Not surprisingly, S can build a set of equations810with unknowns of oðcÞoðcÞ, thus we introduce the following811theorem:

812Theorem 2. In order to defend against a primary ‘safe’ subset813oðcÞoðcÞ (i.e.,

S mi¼1oo

ðciÞ) from been recovered by S or other attack-814ers, the number of observations collected by all participants,815i.e., the size ofO, should be no less than 2pþ 3.

816Proof. An attacker aiming to recover ooðcÞ considers the prob-817lem as solving a set of equations, related to the corre-818sponding m�n

2

� �� ðpþ 1Þ unknown variables. According819to the protocol, S obtains

Pmi¼1ðxxðciÞÞT ðxxðciÞÞ,

820Pm

i¼1ðyyðciÞÞT ðyyðciÞÞ andPm

i¼1ðxxðciÞÞT yyðciÞ, which refer to

821ðpþ1Þðpþ2Þ

2 � 1, 1 and pþ 1 equations related to ooc, respec-822tively. It is essential that the number of variables should

823be larger than that of the corresponding equations, other-

824wise, ooðcÞ could be recovered. Additionally, oðcÞoðcÞ has at

825least pþ 1 observations in order to estimate a regression.

826Thus, the following inequations hold,

m� n=2d e � ðpþ 1Þ > ðpþ 1Þ þ ðpþ 1Þðpþ 2Þ=2 (5)828828

829

m� n=2d e � pþ 1: (6)

831831

832So we set the number of all observations m� n � 2pþ 3,833which meet the above condition and this concludes the

834proof. tu835In this step, S obtains mm and VV of all m� n observations836from all participants, which implies that S can build837

ðpþ1Þðpþ2Þ2 � 1 and pþ 1 equations related to OO, according to

838them, respectively. Note that the size of OO is m� n � 2pþ 3839(according to Theorem 2). Therefore, the number of variables840is at least ð2pþ 3Þðpþ 1Þ. Since ð2pþ 3Þðpþ 1Þ > ðpþ1Þðpþ2Þ

2 þ841p always holds, S can not build enough equations to842recoverOO.843Additionally, original observations can be protected844from being recovered during approaching optimal ‘safe’845subset and deriving refined global model, since observa-846tions swapped in or out of ‘safe’ subset could be viewed as847adding a matrix onto a matrix, or subtracting a matrix from848it, it is impossible for S to deduce any information by the849differences between two received matrices.

8504.2.2 Collusion Attacks

851Malicious participants may collude to recover local aggre-852gated values of Ni, which can be used to further deduce853the related observations. In the example of calculating854

Pmi¼1 ss

ðiÞ in Section 3.4.1, ssðiÞ of Ni is masked with m� 1855one-time pads sharing with m� 1 different participants.856That is to say, in order to recover ssðiÞ, all the m� 1 partici-857pants are malicious and collude with each other. Given M858colluding participants, there are two cases: first, if859M5m� 1, then the probability that all participants shar-

860ing masks with Ni collude is PðMÞ ¼ ðMq Þðm�1Þ. As M � q,

861PðMÞ is extremely small; second, if M < m� 1, it’s862impossible to recover ssðiÞ.

CHANG ETAL.: ADAPTIVE AND BLIND REGRESSION FOR MOBILE CROWD SENSING 9

Page 10: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof863 5 PERFORMANCE EVALUATION

864 5.1 Methodology865 We examine the performance of Lotus via both real and syn-866 thetic datasets. We use three datasets, two are well-known867 and one is generated, which are described as follows:

868 (1) Airfoil self-noise [16]: This dataset includes 1503869 observations. We use attributes of frequency, angle of870 attack, free-stream velocity and suction side displacement871 thicknessto build the influence function of scaled872 sound pressure level.873 (2) Concrete compressive strength [18]: The dataset con-874 tains 1030 observations. We use attributes of cement,875 blast furnace slag, fly ash and age to build the influence876 function of the concrete compressive strength.877 (3) Synthetic: we generate 1400 observations by using878 y ¼ 5þ 5

P9i¼1 xi þ �, where � Nð0; 1Þ and xi N

879 ð0; 1Þ.880 Based on above datasets, we generate two kinds of881 noises. All the noises are random vectors and each dimen-882 sion has independent and identical distributions.

883 (1) Random noise: variables obey uniform distribution884 within the range ½0; vmax � vmin�, where vmax and vmin

885 refer to the maximum and the minimummeasures of886 one certain attribute in certain dataset, respectively.887 (2) Normal distributed noise with Nðm; s2Þ: variables obey888 normal distribution with a mean of m and a standard889 deviation of s, where m and s are the estimates of890 the mean and standard deviation of one given attri-891 bute of certain dataset.892 For each dataset, we generate outliers by adding certain893 noise on original observations under predetermined ratio.894 We compare the performance of Lotus with the state-of-895 art LS estimator, and WLS estimator in which observations896 with larger residuals will receive smaller weight in order to897 reduce the influence of the suspected outliers in modeling.898 WLS leads to the estimate of bb as

bbbW ¼ ðXXTWXWXÞ�1XXTWYWY :900900

901

902 Weight matrix WW is diagonal, where the ith diagonal ele-903 ment refers to the influence of the ith observations. In spe-904 cific, WW is initialized as an identity matrix, and is updated905 according to the current bbbW , and will be used for newmodel906 estimating.907 By utilizing LS estimator, we can obtain the models to the908 original datasets, which are high quality, as the ground

909truths. We evaluate the accuracy acc of an estimate by calcu-910lating the relative difference of model coefficients between911an estimated bbb and the ground truth bb, i.e.,

acc ¼ kbb � bbbkkbbk

:

913913

914For each dataset, noise type and each simulation configura-915tion, we run the simulation 20 times and get the average.

9165.2 Number of Iterations Needed for Rough Global917Model Estimating918As a larger number of iterations mean more interactions919between participants and the server, resulting in a larger920communication cost in modeling and larger privacy risk, we921examine howmany iterations are needed to optimize a ‘safe’922subset. We use above datasets in normally distributed and923random noise settings. We vary the ratio of outlier from 5 to92450 percent at an interval of 5 percent and run the experiments92520 times and get the average number of iterations. We have926following observations according to experimental results927shown in Fig. 7. First, for two real datasets, the average num-928ber of iterations in all noise settings is no more than 2 (expect929for Airfoil dataset with 5 percent outliers). Second, for syn-930thetic dataset, the maximum average number of iterations is9313.4, and the typical average number is around 3. Third, com-932paring with real data, synthetic data need more iterations.933We speculate that it is because the linear relationship between934synthetic observations is stronger than that of real data,935which implies more borderline observations. Fourth, the936number of iterations decreases slightly with the increasing of937outlier ratio. We find that residual differences between nor-938mal data are smaller than that between normal data and out-939liers, thus more effort is needed for optimizing ‘safe’ subset.940Finally, the number of iterations has no obvious difference941under different types of noise, especially for real datasets.

9425.3 Accuracy with Model Updating943In this experiment, we examine the performance of Lotus on944the three datasets, with more observations facilitating945model updating. In order to simulate the periodical regres-946sion model updating in MCS, for each dataset, we randomly947divide the observations into groups, each of them satisfies948that at least n � 50þ 8p observations are included, such949that the number of observations for estimating a regression950model is sufficient according to the suggestion from Green951[19]. For Airfoil self-noise, Concrete compressive strength and

Fig. 7. The number of iterations needed for optimizing ’safe’ subset versus outlier ratio.

10 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019

Page 11: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

952 Synthetic, the number of groups are 15, 6, 10 and 10, respec-953 tively. We then randomly pick up one unused group from954 one certain dataset for model estimating or updating. We955 compare Lotus with LS and WLS, under both the random956 and normal distributed noise settings and three typical out-957 lier ratios, i.e., 5, 20 and 40 percent, implying high, medium958 and low quality data, and we illustrate the experimental959 results in Figs. 8a, 8b, and 8c, respectively.960 Fig. 8 plots the accuracy of estimated models, where961 ground truths are global optimal models, estimated by LS962 according to all original observation groups (without963 noises) obtained from the beginning to the current updating964 period. The horizontal axis indicates the number of updates,965 where zero refers to the initial model without updating. We966 can see that, the accuracy of the model estimated by Lotus967 is continuously improved with new group of observations968 being included in model updating. It can also be seen that969 Lotus outwits LS and WLS in all settings of three datasets.970 Although, in few settings, e.g., Airfoil data with 40 percent971 outliers, WLS can achieve a similar performance Lotus in

972the early rounds of updating, but Lotus refines the model973much faster thanWLS in the following rounds, and achieves974a better estimate in the end. Moreover, with the model being975updated, the performance gaps between Lotus and other976two estimators become even larger.

9775.4 Robustness under Different Outlier Ratios978In this experiment, we further examine the robustness of979Lotus against outliers under different outlier ratios ", and980of different types. In specific, we check both the initial esti-981mates, which are established based on the first groups of982observations without any updating, and the final models983which are updated for several times based on all observa-984tions in the corresponding datasets. For each setting of985noise, the noise ratio " varies from 5 to 50 percent at an986interval of 5 percent. We randomly divide the observations987into groups as described before, for a group of n observa-988tions, we generate n " outliers according to given noise989type, and randomly distribute them to observations in990the group.

Fig. 8. Model accuracy versus the number of updates under different outlier ratios.

CHANG ETAL.: ADAPTIVE AND BLIND REGRESSION FOR MOBILE CROWD SENSING 11

Page 12: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

991 Figs. 9 and 10 plot the accuracy of estimated models992 (ground truth is the corresponding global optimal models)993 as a function of outlier ratio under random and normal dis-994 tribution noise settings, respectively. It can be seen that995 Lotus outwits LS and WLS in all settings. LS is very sensi-996 tive to outliers, even under a low outlier ratio of 5 percent,997 LS gets a bad performance, whereas, Lotus can hold good998 accuracy even when " increases to 40 percent, especially for999 the final estimates. In some cases, particularly in synthetic

1000 dataset, WLS can achieve a high accuracy of final models1001 under low outlier ratio, however, the performance degrades1002 dramatically as the outlier ratio increasing to 10 percent.1003 Oppositely, Lotus is much more tolerant to outliers, and1004 model updating can significantly improve the accuracy of1005 estimates. Overall, we find that Lotus is robust to not only1006 the number but also the randomness of outliers.

1007 6 PROTOTYPE IMPLEMENTATION

1008 We implement Lotus on Google Nexus 5X smartphones1009 (each of which is equipped with a hexa-core 1.8 GHz CPU1010 and 2 GB memory, running on Android 8.0), Huawei Honor1011 5X (KIW-UL00) smartphones (eight-core 1.5 GHz and 2 GB1012 memory, running on EMUI 4.0.3, based on Android 6.0),1013 and a server (with Inter I7 3.5 GHz six-core CPU and 32 GB1014 RAM, running on Windows 10). We use smartphones to1015 serve as different participants in a participatory sensing1016 task, and set the server to collect sensory data from partici-1017 pants and to coordinate the regression modeling procedure1018 accordingly. Data communication between participants and1019 the server is based on WLAN in our laboratory. With this

1020prototype, we conduct experiments based on a dataset of1021Energy efficiency [17], which consists of 763 observations. We1022use attributes of relative compactness, orientation, roof area,1023overall height, glazing area, and glazing area distribution to1024build the influence function of heating load (p is 6).

10256.1 Performance on Model Accuracy1026In this experiment, we randomly divide all observations into10277 groups, each ofwhich contains 109 observations, which sat-1028isfies that at least n � 50þ 8p observations are included in1029each group according to [19], and randomly pick up one1030unused group for estimating an original model or updating1031the model. Then, we further divide each observation group1032into 8 subgroups (each contains 13 or 14 observations) and1033distribute them to all participants randomly. For each obser-1034vation group, outliers are generated by adding certain type1035of noise on observations. Specifically, we apply three types1036of noise, i.e., random noise and Nðm; s2Þ described in Section10375.1, and standard normal distributed noise satisfyingNð0; 1Þ. For1038each type of noise, we vary the outlier ratio " from 5 to 50 per-1039cent at an interval of 5 percent. Under each setting of noise1040(outlier ratio " and types of noise), the server generates1041109 � " tokens and randomly distributes them to all partici-1042pants (g.e., for " ¼ 10%, 11 tokens are generated). For each1043token, a participant randomly picks one observation and1044produces an outlier observation by superposing the noise1045(generated according to the certain noise type) on the obser-1046vation. For each noise setting and outlier ratio, server per-1047forms regression modeling adopting Lotus, LS and WLS,1048respectively. We evaluate model accuracy by conducting1049each experiment 20 times and calculating the average.

Fig. 9. Model accuracy versus outlier ratio (underNðm; s2Þ normal noise setting).

Fig. 10. Model accuracy versus outlier ratio (under random noise setting).

12 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019

Page 13: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof1050 In order to examine the performance of Lotus with model

1051 updating. We calculate average relative differences of1052 model coefficients between estimated models and global1053 optimal models (estimated using the same way as described1054 in Section 5.3). Figs. 11a, 11b, and 11c plot average relative1055 differences as a function of the number of updating under1056 10, 20 and 30 percent outliers, respectively. It can be seen1057 that accuracy of models estimated using LS and WLS1058 remains unchanged in despite of updating. In contrast,1059 accuracy of Lotus improves along with updating in most1060 cases. Especially for random noise outliers, model updating1061 help improving model accuracy significantly, under all1062 three ratios of outliers. With Nðm; s2Þ distributed noise,1063 updating gains less improvement of accuracy as the increas-1064 ing of outlier ratio, which is because Nðm; s2Þ noise leads up1065 to larger errors of observations comparing with other two1066 types of noise.1067 We also verify the robustness of Lotus against outliers1068 under different outlier ratios, and of different types. Fig. 121069 plots the average model accuracy as a function of outlier1070 ratio under different noise settings. Both initial estimates1071 and final models are illustrated. According to Fig. 12, we1072 make following conclusions: First, Lotus can achieve excel-1073 lent accuracy when percentage of standard normal and ran-1074 domly distributed noise is less than 20 percent, and1075 percentage of Nðm; s2Þ distributed noise is less than 10 per-1076 cent. Second, Lotus outwits LS andWLS in all settings, espe-1077 cially when percentage of outlier is less than 40 percent. For1078 instance, when percentage of outlier is equals to 10 percent,1079 the final estimate accuracy of Lotus can be one order of1080 magnitude higher than LS and WLS. Third, the accuracy

1081decrease of Lotus is the slowest as the increase of outlier1082ratio. Overall, for Lotus, a final estimate is more accurate1083than its corresponding initial estimate under low and mod-1084erate outlier ratios. The accuracy difference between final1085and initial models reduces, because too many outliers make1086the choice of optimal ’safe’ subset during updating much1087harder, which impedes the refinement of estimate.1088Additionally, we measure iterations needed for approa-1089ching optimal ‘safe’ subset under three types of noise. we1090vary outlier ratio from 5 to 50 percent at an interval of 5 per-1091cent and run the experiments 20 times and get the average1092number of iterations. Experimental result is depicted in1093Fig. 13, which demonstrates that the average number of itera-1094tions are no more than 3 under any type of noise. Further-1095more, we can conclude that the average numbers of iterations1096are also robust to the percentage of outliers. Finally, we find1097that random noise outliers imply more iterations. We con-1098sider that LS-based estimatorsmake an assumption that gross1099errors satisfy normal distribution, thus random noise will1100lead to larger deviation of initial ‘safe’ subset from the opti-1101mal one, and consequently, more iterations are needed.

11026.2 Computational Complexity1103In this experiment, we aim to measure time consumption of1104executing Lotus on mobile devices. We also use the observa-1105tion groups of Energy efficiency, each of which contains 1091106observations. We vary the number of participants involved1107in regression, denoted as m, from 3 (the minimum number1108of participants should be included in one-time pad masking1109based aggregations) to 15 (we assume that at least pþ 1, i.e.,11107 observations should be collected by each participants),

Fig. 11. Model accuracy versus the number of updates under different outlier ratios.

Fig. 12. Model accuracy versus outlier ratio under different noise settings.

CHANG ETAL.: ADAPTIVE AND BLIND REGRESSION FOR MOBILE CROWD SENSING 13

Page 14: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

1111 and calculate the number of observations obtained by each1112 participants, i.e., n ¼ 109

m . We measure the time consumption1113 on steps of selecting primary ‘safe’ subset, approaching optimal1114 ‘safe’ subset, and deriving refined global model under different1115 numbers of observations. Figs. 14a and 14b are the boxplots1116 of time consumption on Huawei Honor and Google Nexus1117 smartphones, respectively. It can be seen that time con-1118 sumption increases linearly with the number of observa-1119 tions n in all steps (notice we skip several impossible values1120 of n, e.g., 14, 16, 17, 28-35, etc.). Surprisingly, we find that,1121 although the hardware configurations of Huawei Honor 5X1122 and Google Nexus 5X smartphones are comparable, the1123 time consumption on Nexus is one order of magnitude1124 lower than that on Huawei. We speculate it may because1125 Nexus runs on a higher version of Android.

1126 7 RELATED WORK

1127 Privacy-preserving data aggregation has been widely inves-1128 tigated in the literature. Most privacy-preserving aggrega-1129 tion methods, however, only focus on calculating simple1130 aggregation functions such as sum,mean andmax/min, which1131 cannot be used for privacy-preserving regression estimation1132 directly. Privacy-preserving outlier detection has been inves-1133 tigated in distributed systems [9]. Most methods deal with1134 distanced-based outliers. However, the definition of outliers1135 in regression is much more complex. Perturbation technolo-1136 gies are widely used for privacy protecting, where random1137 noises are used to cover up private data. PoolView [20]

1138protects privacy of stream data in participatory sensing. The1139core idea is to add the randomnoisewith known distribution1140to raw data, after which a reconstruction algorithm is used to1141estimate the distribution of original data. The disadvantage1142of perturbation-based schemes is that additional noises1143introducedmake regression estimates inaccurate.1144Some researchers focus on robust regression without1145considering privacy issue. For example, M. Jagielski et al.,1146investigate poisoning attacks on regression learning, and1147propose a defense approach TRIM, which iteratively esti-1148mates the regression parameters based on a subset of points1149with lowest residuals in each iteration [21]. S. Lathuiliere1150et al. derive an optimization algorithm, DeepGUM, which1151can adapt to the updating of outlier distribution. DeepGUM1152requires a supervised training with clean samples [22].1153However, distinguishing clean samples from outliers is1154non-trivial. Studies on constructing an accurate regression1155model with privacy-preserving are more related to this1156work. M-PERM is a mutual privacy-preserving regression1157modeling approach, where a series of data transformation1158and aggregation operations are operated at the participatory1159nodes to preserve data privacy [5]. However, it should be1160emphasized that these methods are all based on the LS esti-1161mator. The correctness of them relies on the assumption1162that original data are collected correctly without gross1163errors, i.e., outliers. Any incorrect observations may break-1164down the estimation, since LS estimator is very sensitive to1165outliers. PURE, an outlier tolerant privacy-preserving1166regression scheme, proposed by S. Chang et al. is most rele-1167vant to our work [10]. However, PURE does not notice the1168opportunistic and non-stationary features of sensory data,1169and cannot deal with model updating. Furthermore, in1170PURE, model estimation requires a number of iterations,1171which is undesirable on energy-constrained mobile devices,1172due to high communication and computation cost.

11738 CONCLUSION

1174In this paper, we have introduced Lotus, a scheme for regres-1175sion and model updating with low-quality, private, opportu-1176nistic and non-stationary sensory data in MCS. Lotus1177provides strong protection on the confidentiality of raw sen-1178sory data by masking aggregated result with cancellable one-1179time pads, and by blind optimizing of ‘safe’ subset with a1180blocking scheme, and achieves good model accuracy under1181very low quality and non-stationary data, through identifying

Fig. 14. Computation time on smartphones versus the number of observations.

Fig. 13. Number of iterations needed versus outlier ratio under differentnoise settings.

14 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 18, NO. X, XXXXX 2019

Page 15: Adaptive and Blind Regression for Mobile Crowd Sensinglion.sjtu.edu.cn/.../lionweb/data/publication/text/... · IEEE Proof 1 Adaptive and Blind Regression for 2 Mobile Crowd Sensing

IEEE P

roof

1182 suspected outliers and withdrawing outdated observations1183 during model updating. Both analysis and extensive simula-1184 tion results demonstrate the efficacy of Lotus. We have also1185 built a prototype system ofMCS in our lab, and further exam-1186 ined the feasibility of Lotus under real deployment.

1187 ACKNOWLEDGMENTS

1188 This work was supported in part by the National Natural1189 Science Foundation of China (Grant No. 61672151,1190 61772340, 61420106010), Shanghai Rising-Star Program1191 (Grant No.17QA1400100), National Key R&D Program of1192 China (Grant No. 2018YFC1900700), Shanghai Municipal1193 Natural Science Foundation (Grant No. 18ZR1401200), the1194 Fundamental Research Funds for the Central Universities1195 (Grant No. EG2018028), DHU Distinguished Young Profes-1196 sor Program, and 2017 CCF-IFAA Research Fund.

1197 REFERENCES

1198 [1] J. Vanus, P. Valicek, T. Novak, R. Martinek, P. Bilik, and J. Zidek,1199 “Utilization of regression analysis to increase the control accuracy1200 of dimmer lighting systems in the smart home,” IFAC-Papers1201 OnLine, vol. 49, 2016, pp. 517–522.1202 [2] H. J. Lee, R. B. Chatfield, and A.W. Strawa, “Enhancing the applica-1203 bility of satellite remote sensing for PM2.5 estimation using MODIS1204 deep blue AOD and land use regression in california, united states,”1205 Environmental Sci. Technol., vol. 50, pp. 6546–6555, 2016.1206 [3] H. Kikuchi, C. Hamanaga, H. Yasunaga, H. Matsui, and H. Hashi-1207 moto, “Privacy-preserving multiple linear regression of vertically1208 partitioned real medical datasets,” in Proc. IEEE 31st Int. Conf.1209 Advanced Inf. Netw. Appl., 2017, pp. 1042–1049.1210 [4] Y. Gong, Y. Fang, and Y. Guo, “Private data analytics on biomedi-1211 cal sensing data via distributed computation,” IEEE/ACM Trans.1212 Comput. Biology Bioinf., vol. 13, no. 3, pp. 431–444, May-Jun 2015.1213 [5] K. Xing, Z. Wan, P. Hu, H. Zhu, Y. Wang, X. Chen, Y. Wang,1214 and L. Huang, “Mutual privacy-preserving regression modeling1215 in participatory sensing,” in Proc. IEEE INFOCOM, 2013,1216 pp. 3039–3047.1217 [6] A. F. Karr, X. Lin, J. Reiter, and A. P. Sanil, “Secure regression1218 on distributed databases,” Comput. Graphical Statist., vol. 14,1219 pp. 263–279, 2004.1220 [7] Y. Zhang, N. Meratnia, and P. J. M. Havinga, “Distributed online1221 outlier detection in wireless sensor networks using ellipsoidal1222 support vector machine,” Ad Hoc Netw., vol. 11, pp. 1062–1074,1223 2013.1224 [8] A. D. Paola, S. Gaglio, G. L. Re, F. Milazzo, and M. Ortolani,1225 “Adaptive distributed outlier detection for WSNs,” IEEE Trans.1226 Cybernetics, vol. 45, no. 5, pp. 902–913, May 2015.1227 [9] J. Vaidya and C. Clifton, “Privacy-preserving outlier detection,” in1228 Proc. IEEE 9th Int. Conf. Young Comput. Sci., 2004, pp. 233–240.1229 [10] S. Chang, H. Zhu, W. Zhang, L. Lu, and Y. Zhu, “PURE: Blind1230 regression modeling for low quality data with participatory1231 sensing,” IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 4, pp. 1199–1232 1211, Apr. 2016.1233 [11] M. Kutner, C. Nachtsheim, J. Neter, and W. Li, Applied Linear Sta-1234 tistical Models, New York, NY, USA: McGraw-Hill, 2005.1235 [12] D. Ruppert, and M. P. Wand, “Multivariate locally weighted least1236 squares regression,” Ann. Statist., vol. 22, pp. 1346–1370, 1994.1237 [13] P. Mahalanobis, “On the generalised distance in statistics,” Proc.1238 Nat. Inst. Sci., vol. 2, pp. 49–55, 1936.1239 [14] R. Hemmecke, M. K€oppe, J. Lee, and R. Weismantel, “Nonlinear1240 integer programming,” in 50 Years of Integer Programming 1958–1241 2008, Berlin, Germany: Springer, 2009, pp. 561-618.1242 [15] M. R. Garey, and D. S. Johnson, Computers and Intractability a Guide1243 to the theory of NP-Completeness, San Francisco, CA, USA: Freeman,1244 1979.1245 [16] The Airfoil self-noise dataset, [Online]. Available: [Online]. Availi-1246 able: https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise1247 [17] Energy efficiency dataset, [Online]. Availiable: https://archive.1248 ics.uci.edu/ml/datasets/Energy+efficiencyQ41249 [18] Concrete compressive strength dataset, [Online]. Availiable:1250 https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive1251 + Strength

1252[19] S. Green, “How many subjects does it take to do a regression ana-1253lysis?,”Multivariate Behavioral Res., vol. 26, pp. 499–510, 1991.1254[20] R. Ganti, N. Pham, Y. Tsai, and T. Abdelzaher, “PoolView: Stream1255privacy for grassorts participatory sensing,” in Proc. 6th ACM1256Conf. Embedded Netw. Sensor Syst., 2008, pp. 281–294.1257[21] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. N-Rotaru, and B. Li,1258“Manipulating machine learning: Poisoning attacks and counter-1259measures for regression learning,” in Proc. IEEE Symp. Security1260Privacy, 2018, pp. 19–35.1261[22] S. Lathuiliere, P. Mesejo, X. A-Pineda, and R. Horaud,1262“DeepGUM: Learning deep robust regression with a gaussian-1263uniform mixture model,” in Proc. Eur. Conf. Comput. Vis., 2018,1264pp. 202–217.

1265Shan Chang received the PhD degree in com-1266puter software and theory from Xi�an Jiaotong1267University, in 2013. From 2009 to 2010, she was1268a visiting scholar with the Department of Com-1269puter Science and Engineering, Hong Kong Uni-1270versity of Science and Technology. She was also1271a visiting scholar with the BBCR Research Lab,1272University of Waterloo, from 2010 to 2011. She is1273now an associate professor with the Department1274of Computer Science and Technology, Donghua1275University, Shanghai. Her research interests1276include security and privacy in mobile networks and sensor networks.1277She is a member of the IEEE.

1278Chao Li received the BS degree in computer sci-1279ence and technology from Anhui University, in12802016. He is currently working toward the post-1281graduate degree in the Department of Computer1282Science and Technology, Donghua University,1283Shanghai. His research interests include privacy1284and distributed system security. He is a member1285of the ACM.

1286Hongzi Zhu received the PhD degree in com-1287puter science from Shanghai Jiao Tong Univer-1288sity, in 2009. He was a post-doctoral fellow with1289the Department of Computer Science and Engi-1290neering, Hong Kong University of Science and1291Technology, and the Department of Electrical and1292Computer Engineering, University of Waterloo, in12932009 and 2010, respectively. He is now an associ-1294ate professor with the Department of Computer1295Science and Engineering, Shanghai Jiao Tong1296University. His research interests include vehicu-1297lar networks, mobile sensing, and computing. He received the Best1298Paper Award from IEEE Globecom 2016. He was a leading guest editor1299for IEEE Network Magazine. He is an associate editor for the IEEE1300Transactions on Vehicular Technology. He is a member of the IEEE,1301IEEE Computer Society, IEEE Communication Society, and IEEE Vehic-1302ular Technology Society. For more information, please visit http://lion.1303sjtu.edu.cn.

1304Hang Chen received the BS degree from the1305Shanghai University of Engineering and Science,1306in 2017. He is currently working toward the post-1307graduate degree in the Department of Computer1308Science and Technology, Donghua University,1309Shanghai. His research interests include Internet1310of Things, mobile sensing, and wearable comput-1311ing. He is a member of the ACM.

1312" For more information on this or any other computing topic,1313please visit our Digital Library at www.computer.org/publications/dlib.

CHANG ETAL.: ADAPTIVE AND BLIND REGRESSION FOR MOBILE CROWD SENSING 15