Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf ·...

10
Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy, and Liviu Iftode Department of Computer Science, Rutgers University {spravin,vinodg,iftode}@cs.rutgers.edu ABSTRACT To usefully query a location-based service, a mobile device must typically present its own location in its query to the server. This may not be acceptable to clients that wish to protect the privacy of their location. This paper presents the design and implementation of SybilQuery, a fully decentralized and autonomous k-anonymity-based scheme to privately query location-based services. SybilQuery is a client-side tool that generates k - 1 Sybil queries for each query by the client. The location-based server is presented with a set of k queries and is unable to distinguish between the client’s query and the Sybil queries, thereby achieving k-anonymity. We tested our implementation of SybilQuery on real mobility traces of approximately 500 cabs in the San Francisco Bay area. Our experiments show that SybilQuery can efficiently generate Sybil queries and that these queries are indistinguishable from real queries. Author Keywords Location-based Services, Vehicular Computing, Privacy, Anonymity. ACM Classification Keywords K.6.5 Management of Computing and Information Systems: Security and Protection General Terms Design, Experimentation. INTRODUCTION Modern mobile devices contain global positioning systems (GPS) that let these devices precisely determine their location. Coupled with Internet connectivity via 3G and WiFi, GPS helps transform these mobile devices to truly ubiquitous devices that can make location-based queries. A querying client device sends its location to a server, which responds with answers specific to the client’s location. Examples of such queries include real-time traffic queries, e.g.,“how congested is the route from my home to my office?” and points-of-interest queries, e.g.,“what are the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Ubicomp 2009, Sep 30 – Oct 3, 2009, Orlando, Florida, USA. Copyright 2009 ACM 978-1-60558-431-7/09/09...$10.00 restaurants close to my current location?” Several modern devices, such as the iPhone, are pre-installed with query interfaces to location-based services, such as Google Maps. However, clients that wish to protect their privacy may not want to reveal their current location to a location- based server (LBS). Client location can be misused by LBSs in ways that range from serious breaches of privacy, such as spying on the client’s daily commuting patterns, to simple annoyances, such as flooding the client with location-based spam and advertising in response to a query. Recent evidence suggests that privacy concerns can deter the widespread use of LBSs. For example, EZPass RFID tags have seen a low adoption rate due to privacy concerns among drivers [10, 11]. Similarly, the New York City cab drivers’ group called a strike upon the introduction of a plan to fit all cabs with GPS tracking devices [36]. Such concerns motivate the need for techniques that protect the privacy of client location from LBSs. Several prior techniques [1, 5, 8, 13, 14, 16, 17, 19, 23, 25, 34, 45] to protect the privacy of client location fall under the broad umbrella of k-anonymity [39, 42]. In a k-anonymity- based scheme, an LBS is unable to distinguish a querying client from a group of at least k clients, where k is a security parameter. For example, Spatial cloaking [19] is one such scheme that sends cloaked regions (e.g., rectangular blocks) containing at least k clients to the LBS. Unfortunately, most previously proposed k-anonymity-based schemes require a trusted third party, called an anonymizer that accepts and forwards queries from a client to an LBS and ensures the k-anonymity property. Because anonymizers must compute cloaked regions for each client and process the LBS’ query results for these regions, they can cause performance and scalability bottlenecks. In addition, most of these schemes do not work for mobile clients. Recently proposed alternatives to eliminate anonymizers include peer-to-peer techniques [5, 16, 17] and protocols based on private information retrieval (PIR) [15, 22]. Peer-to-peer techniques rely on the participation of k peers to ensure k-anonymity, in a manner akin to Crowds [37]. However, reliance on peers restricts the autonomy of such systems. Techniques based on PIR provide strong cryptographic guarantees, but are currently computationally expensive. This paper presents the design and implementation of Sybil- Query, a fully decentralized and autonomous k-anonymity- based system for location-based queries. SybilQuery is a client-side tool that operates by synthetically generating k - 1 Sybil queries for each location-based query by a client.

Transcript of Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf ·...

Page 1: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

Privately Querying Location-based Serviceswith SybilQuery

Pravin Shankar Vinod Ganapathy and Liviu IftodeDepartment of Computer Science Rutgers University

spravinvinodgiftodecsrutgersedu

ABSTRACTTo usefully query a location-based service a mobile devicemust typically present its own location in its query to theserver This may not be acceptable to clients that wish toprotect the privacy of their location This paper presentsthe design and implementation of SybilQuery a fullydecentralized and autonomous k-anonymity-based schemeto privately query location-based services SybilQuery is aclient-side tool that generates k minus 1 Sybil queries for eachquery by the client The location-based server is presentedwith a set of k queries and is unable to distinguish betweenthe clientrsquos query and the Sybil queries thereby achievingk-anonymity We tested our implementation of SybilQueryon real mobility traces of approximately 500 cabs in the SanFrancisco Bay area Our experiments show that SybilQuerycan efficiently generate Sybil queries and that these queriesare indistinguishable from real queries

Author KeywordsLocation-based Services Vehicular Computing PrivacyAnonymity

ACM Classification KeywordsK65 Management of Computing and Information SystemsSecurity and Protection

General TermsDesign Experimentation

INTRODUCTIONModern mobile devices contain global positioning systems(GPS) that let these devices precisely determine theirlocation Coupled with Internet connectivity via 3G andWiFi GPS helps transform these mobile devices to trulyubiquitous devices that can make location-based queriesA querying client device sends its location to a serverwhich responds with answers specific to the clientrsquos locationExamples of such queries include real-time traffic queriesegldquohow congested is the route from my home to myofficerdquo and points-of-interest queries egldquowhat are the

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page To copy otherwise orrepublish to post on servers or to redistribute to lists requires prior specificpermission andor a feeUbicomp 2009 Sep 30 ndash Oct 3 2009 Orlando Florida USACopyright 2009 ACM 978-1-60558-431-70909$1000

restaurants close to my current locationrdquo Several moderndevices such as the iPhone are pre-installed with queryinterfaces to location-based services such as Google Maps

However clients that wish to protect their privacy maynot want to reveal their current location to a location-based server (LBS) Client location can be misused byLBSs in ways that range from serious breaches of privacysuch as spying on the clientrsquos daily commuting patternsto simple annoyances such as flooding the client withlocation-based spam and advertising in response to a queryRecent evidence suggests that privacy concerns can deterthe widespread use of LBSs For example EZPass RFIDtags have seen a low adoption rate due to privacy concernsamong drivers [10 11] Similarly the New York City cabdriversrsquo group called a strike upon the introduction of a planto fit all cabs with GPS tracking devices [36] Such concernsmotivate the need for techniques that protect the privacy ofclient location from LBSs

Several prior techniques [1 5 8 13 14 16 17 19 23 2534 45] to protect the privacy of client location fall under thebroad umbrella of k-anonymity [39 42] In a k-anonymity-based scheme an LBS is unable to distinguish a queryingclient from a group of at least k clients where k is a securityparameter For example Spatial cloaking [19] is one suchscheme that sends cloaked regions (eg rectangular blocks)containing at least k clients to the LBS Unfortunately mostpreviously proposed k-anonymity-based schemes require atrusted third party called an anonymizer that accepts andforwards queries from a client to an LBS and ensures thek-anonymity property Because anonymizers must computecloaked regions for each client and process the LBSrsquoquery results for these regions they can cause performanceand scalability bottlenecks In addition most of theseschemes do not work for mobile clients Recently proposedalternatives to eliminate anonymizers include peer-to-peertechniques [5 16 17] and protocols based on privateinformation retrieval (PIR) [15 22] Peer-to-peer techniquesrely on the participation of k peers to ensure k-anonymityin a manner akin to Crowds [37] However reliance onpeers restricts the autonomy of such systems Techniquesbased on PIR provide strong cryptographic guarantees butare currently computationally expensive

This paper presents the design and implementation of Sybil-Query a fully decentralized and autonomous k-anonymity-based system for location-based queries SybilQuery isa client-side tool that operates by synthetically generatingkminus1 Sybil queries for each location-based query by a client

Sybil queries contain locations that resemble the clientrsquosactual location For example Sybil queries for a real queryfrom a busy downtown area will also be from areas withsimilar traffic conditions SybilQuery sends these k queriesto the LBS which is unable to identify the original queryfrom the synthetic queries thereby ensuring k-anonymityof the clientrsquos location SybilQueryrsquos design offers severaladvantages

(a) Performance SybilQuery eliminates the need foranonymizing servers and the scalability and performancebottlenecks associated with centralized designs For exam-ple choosing higher values of k for better privacy leads togreater computational overheads at the anonymizer [1 34]SybilQueryrsquos decentralized design offloads the computationsneeded to achieve k-anonymity to the clients themselvesOur experiments show that the costs of generating Sybilqueries at the client are negligible(b) Autonomy In contrast to prior work where queriesgenerated by an anonymizer (or peer-to-peer techniques)depend on other clients of the LBS SybilQuery generatesSybil queries autonomously For example cloaked regionsgenerated by spatial cloaking techniques [19] must haveat least k clients this in turn depends on the geographicdistribution of clients [26](c) Ease of deployment A decentralized and autonomousdesign lends itself to easy deployment Clients do not haveto rely on peer participation or on service providers to deployanonymizers in order to achieve k-anonymity Unlike PIR-based techniques SybilQuery does not require any changesto the LBS and only requires minor modifications to thequerying client We have integrated SybilQuery with LBSssuch as Google Maps Live Maps and Yahoo Maps

We have implemented a prototype of SybilQuery for vehic-ular networking applications Vehicles today are increas-ingly being equipped with computing device having GPScapability and internet connectivity which makes vehiclesan interesting ubiquitous computing environment OurSybilQuery system takes as input a path to be followed by avehicle along which the vehicle may issue several queries toLBSs SybilQuery outputs kminus1 Sybil paths that statisticallyresemble the input path The vehicle sends k queries whenit accesses an LBSmdashits actual location as well as k minus 1waypoints derived from the Sybil paths We evaluatedthis prototype using real mobility traces of approximately500 cabs in San Francisco [2] We also conducted a userstudy with 15 volunteers which confirmed that Sybil pathsproduced by SybilQuery closely resemble real paths

PROBLEM DEFINITION AND ASSUMPTIONSAn LBS is a database that stores a set of tuples lt ` v gtwhere `rsquos are geographic locations eg represented usinglatitudeslongitudes and each v denotes value(s) associatedwith location ` eg restaurants or current traffic conditionsat ` We consider a model in which mobile clientsperiodically query the LBS as they move from a source toa destination Each query contains the current location ofthe client and the LBS returns the set of values associatedwith that location

The problem of privately querying an LBS is for a clientto issue location-based queries without revealing its currentlocation to an adversary The SybilQuery system aims toachieve this goal by sending k minus 1 Sybil queries along witheach real query by the client These queries contain locationsalong k minus 1 Sybil paths which are synthetically generatedby SybilQuery To prevent an adversary from identifyingthe clientrsquos actual location SybilQuery must generate Sybilpaths (and thereby Sybil queries) that closely resemble realpaths

We make the following assumptions about the adversary

(a) Adversary cannot access clientrsquos identifiers A fully-constructed client query to the LBS logically consists ofthree entities (i) a network identifier such as the IP addressof the clientrsquos device that issues the location-based query(ii) an explicit user identifier such as the userrsquos registeredaccount name at the LBS or an HTTP cookie and (iii) theclientrsquos location as may be obtained by the client using GPSWe assume that the adversary does not have access to theclientrsquos network identifiers To hide network identifiers fromthe adversary we assume that the clientrsquos service provider(eg the cellular provider) and local access points whichknow the clientrsquos current location do not collude with theadversary (eg the LBS) We also assume that the LBS doesnot require the user to send explicit user identifiers in orderto use the service This is indeed a practical assumption(and is also a standard assumption in prior work) severalpopular LBSs such as Google Maps Microsoft Live Mapsand Yahoo Maps do not require a registered user or HTTPcookies for a client to usefully query them We furtherassume that the adversary does not use the query contentto infer the clientrsquos location Thus the adversary can onlyobserve the location field in each client query(b) Adversary could be active or passive A passiveadversary can observe the contents of all location-basedqueries as well as the responses received from the LBS Onthe other hand an active adversary can modify responsesfrom the LBS as a way to compromise client privacy Forexample consider an adversarial LBS that reports real-time traffic information Upon receiving a request fortraffic conditions at a particular location ` the LBS couldreturn false information eg by reporting heavy trafficat ` This response from the LBS may cause the clientto behave differently eg take a detour to avoid heavytraffic The SybilQuery system must therefore respondappropriately eg by appropriately modifying the Sybilqueries to statistically match the real query(c) Statistical background knowledge The adversarycould have access to global statistical knowledge suchas the average traffic densities at different locations atdifferent times during the day This information canpotentially be used to compromise client privacy TheSybilQuery prototype can currently defend against severalsuch statistical background knowledge attacks particularlythose that use knowledge about regional traffic Howeverbecause we cannot anticipate all such attacks we havedesigned SybilQuery to support an extensible architectureIn particular it can incorporate several statistical featuresin its algorithm to generate Sybil paths thereby making

EndpointGenerator

QueryGenerator

PathGenerator

Source amp Destination

Value of k

Original andk-1 synthetic

endpoints

Trafficstatistics

Regionalmaps

Path P andk-1 synthetic

paths

Client location l

Location l ampk-1 synthetic

locations

Response

Locationbasedservice

Figure 1 Basic design of SybilQuery

the system robust to future attacks SybilQuery howevercannot defend against adversarial LBSs that have targettedbackground information about a specific client If theLBS knows a particular clientrsquos daily commuting patternsor knows specific locations that the client visits it candifferentiate between real queries and Sybil queries sent bythat client with high probability For example if the LBSknows a clientrsquos residential address (R) and his work address(W ) it can identify the real path followed by the client bysearching for a path that starts at R and ends at W

HIDING CLIENT LOCATION USING SYBIL QUERIESThe SybilQuery system presents an interface akin to existingnavigation systems A user enters the address of the sourceand destination of a trip and the system outputs a path Pfor the user to follow A destination prediction schemesuch as the one proposed by Krumm and Horvitz [28]can be optionally used to alleviate the need for the userto enter the destination Additionally SybilQuery requiresas input a security parameter k that it uses to generatek minus 1 Sybil paths Each path is represented by the systemas a sequence of waypoints enroute from the source to thedestination As depicted in Figure 1 SybilQuery consistsof three modules an endpoint generator a path generatorand a query generator We describe each of these modules indetail below

The endpoint generator produces k minus 1 synthetic startand end points that statistically resemble the source anddestination input by the user so that the real trip doesnot stand out from the synthetic trips To do so theendpoint generator requires a database of regional trafficstatistics This database reflects historic traffic trendsat various geographic locations in the area (it is not adatabase of real-time traffic conditions) The high levelidea is that the endpoint generator processes this databaseto identify clusters of locations that share similar featuressuch as traffic density (details of specific features usedin our implementation are deferred to the next section)This process is a one-time activity and suffices to producesynthetic start and end points for multiple trips eg untilthe database is refreshed with newer statistics The endpointgenerator chooses synthetic start and end points from theclusters to which the actual source and destination belong Indoing so it also ensures that the Euclidean distance betweenthe source and destination of the actual path are within athreshold of the Euclidean distance between the syntheticstart and end points

The path generator uses the k start and end pointsincluding the actual source and destination to produce k

paths each represented as a sequence of waypoints Ingenerating paths it consults a database of regional maps (asdo existing on-board navigation systems) In fact the pathgenerator may be implemented using existing navigationsystems such as Google Maps Yahoo Maps and on-boardGPS devices because their interfaces are similar Both theendpoint generator and path generator need to be invokedonly once per trip

The query generator is triggered when the user queries theLBS with his current location ` Intuitively it simulates themotion of users along the k minus 1 Sybil paths and generatesk minus 1 Sybil locations In its simplest form the querygenerator simply computes the offset of ` from the source ofthe userrsquos path P and applies similar offsets to the sourcesof the Sybil paths to produce kminus1 Sybil locations Howeverit can also use current traffic conditions to more accuratelysimulate user movement along Sybil paths (eg simulateslower movement if traffic is congested at one of the Sybillocations)

The modular design of SybilQuery offers several advan-tages It lends itself to easy deployment even with legacydevices The endpoint generator is a standalone componentthat can possibly be installed as a user-level applicationeither on the userrsquos desktop or on his mobile deviceThe path generator can be implemented using off-the-shelfnavigation systems Using a navigation system enablesrobust path generation based on regional maps which allowsSybilQuery to generate Sybil paths that respect features suchas one-ways and road closures It also allows SybilQueryto robustly handle detours from the path P suggested bythe navigation system Upon a detour SybilQuery cansimply trigger the path generator to produce a new path tothe destination Because paths are generated independentof each other SybilQuery can also simulate detours inSybil paths SybilQueryrsquos design only requires the querygenerator to be modified to send k queries to the LBS insteadof one Based upon the deployment scenario even thismodification should be relatively easy For example if anend user employs a Web browser on his mobile phone tosend location-based queries to an LBS the query generatorcould be implemented as a browser extension

Several extensions to the basic design above can improve therobustness of the system to potential attacks

(a) Randomizing path selection Path generators im-plemented using navigation systems typically return theshortest route to the destination However real users maynot always follow the shortest path to the destination becauseof factors such as detours and road closures If SybilQueryalways produces shortest Sybil paths and the user choosesa longer path to the destination an adversary will be ableto differentiate the real path from Sybil paths with highprobabilityThis problem is addressed with path generators that cancompute multiple paths to the destination (each with varyinglengths) Instead of choosing the shortest path the pathgenerator instead uses a probability distribution (of thefrequency with which users choose paths other than the

shortest path) to make an appropriate choice from one of theavailable paths to the destination(b) Handling active adversaries An actively adversarialLBS may return doctored query responses as an attemptto differentiate Sybil paths from a clientrsquos real path Forexample suppose that an adversarial LBS falsely reportstraffic congestion at a particular location in response to a userquery In response a real user will likely take a detour If theSybilQuery system does not respond similarly to doctoredresponses (eg by checking for congestion and triggeringdetours in Sybil paths) an adversary could likely distinguishthe real path from Sybil pathsSybilQuery handles active adversaries usingN -variant queriesIn this technique SybilQuery queries multiple LBSs andcompares their responses in a manner akin to N -variantsystems [6] Unless all the LBSs queried collude with eachother adversarial LBSs are likely to be detected(c) Endpoint caching Suppose that a real path P fre-quented by the user (eg commuter paths such as hometo office) is associated with multiple sets of Sybil pathsAn LBS that observes paths over a period of time canstatistically identify P as the real path Similarly when auser travels to a new location shortly after arriving at thefirst destination if the second set of Sybil trips are randomlygenerated then the real paths can be easily distinguishedfrom the Sybil paths since the real paths share an endpointSybilQuery handles the above attacks by performing threetypes of caching (1) for the most common trips by the userthe Sybil endpoints are cached (2) if the user makes multipletrips from one common endpoint (eg his home or office)the corresponding Sybil endpoints are cached and (3) whenthe user embarks on a multi-destination trip the start pointsof the Sybil trips are cached so that the endpoint of a trip isthe same as the startpoint of the following trip(d) Providing path continuity Although the paths gen-erated by SybilQuery statistically resemble each other itis possible for the trip durations to differ (for both realand Sybil trips) If a real trip ends before some of theSybil trips end and the system stops sending queriesthe LBS can differentiate the real path from Sybil pathsSybilQuery guards against this by being an ldquoalways onrdquo toolthat continues to simulate movement along Sybil paths evenwhen the userrsquos real trip is complete This feature ensuressecurity even when the lengthsdurations of Sybil trips donot match those of real trips(e) Adding GPS sensor noise The location returned byGPS devices is typically inaccurate with the errors follow-ing a Gaussian distribution [7] If the Sybil locations alwaysfall on a road segment an adversary can differentiate themfrom locations returned by real GPS devices SybilQuerycan prevent this by adding a random noise to each Sybillocation

THE SybilQuery PROTOTYPEIn this section we describe the implementation of ourSybilQuery prototype The bulk of this section is devotedto the description of the endpoint generator which is imple-mented as a Python client that uses a PostgreSQL databasebackend with PostGIS spatial extensions to store regionaltraffic information The path generator is implemented asa client that queries an off-the-shelf service [35] to find the

set of waypoints corresponding to the shortest path givena startpoint and an endpoint Finally the query generatorsimulates user movement along all the k paths

The Endpoint GeneratorAs discussed in the previous section the endpoint generatoruses the source and destination addresses input by the userto generate k minus 1 synthetic endpoints To ensure thatthese synthetic endpoints statistically resemble the actualendpoints of the userrsquos trip the endpoint generator usesa regional traffic database The endpoint generator firstpreprocesses the database to identify key features of eachgeographic location

Regional Traffic Database For our prototype we useda month-long (August 25 2008ndashSeptember 24 2008) setof GPS traces from the Cabspotting project [2] as regionaltraffic database The Cabspotting project tracks the mobilityof cabs in the San Francisco Bay area1 Each cab thatparticipated in this project was outfit with a GPS device thatupdated its location with a server each minute Each of theseupdates was a quadruplelttimestamp cab ID current location flaggtHere flag indicates whether the cab was metered (ie ina trip) or empty We used flag to convert the GPS tracesinto trips (ie to demarcate sources and destinations duringwhich the cab was metered) Overall our database containsa total of 529533 trips by 530 unique cabs we observed atleast 447 cabs on any given day Although we tailor ourdiscussion in the rest of this section to this database weemphasize that the techniques employed by SybilQuery areapplicable to any traffic database

Preprocessing the Traffic Database An ideal implemen-tation of the endpoint generator would use an annotateddatabase of the local region to identify synthetic endpointsthat resemble the endpoints input by the user Theannotated database would contain descriptive tags for eachgeographic location such as ldquoparking lotrdquo ldquodowntownoffice buildingrdquo or ldquofreewayrdquo The endpoint generatorwould output synthetic sources and destinations whose tagsmatch the corresponding tags of the source and destinationsupplied by the user However such annotated databasesare laborious to create and even so are unlikely to becomprehensive or contain tags suitable for LBSs in differentapplication domains

The preprocessing step addresses this problem by automati-cally computing a feature set that describes each geographiclocation using information from the traffic database Theseautomatically extracted features then serve as tags that theendpoint generator can use to find synthetic sources anddestinations that statistically resemble the userrsquos input

Our implementation uses traffic density as the feature thatcharacterizes each geographic location We define the trafficdensity τ` of a geographic location ` as the number of tripsthat start end or traverse through ` in a fixed interval of time1Consequently our experiments also focus on trips in thisgeographic region However SybilQuery can automaticallygenerate Sybil queries for any geographic region if provided with asimilar traffic database for that region

(we describe in detail below our how our system representsgeographic locations for simplicity it suffices to think ofthem as fixed-size rectangular regions)We calculated τ`for each location ` using the PostGIS spatial extension ofPostgreSQL Because τ` can acquire a large number ofdiscrete values we used a simple clustering algorithm togroup ldquosimilarrdquo values of τ` into a single cluster Althoughany clustering technique may be used we found that evensimple clustering algorithms such as separating values of τ`into buckets serve our purpose well We empirically verifiedthat these clusters of the San Francisco Bay area sharesimilar semantic properties For example different shoppingcenters were grouped together into the same cluster as wereresidential areas

In addition to computing the traffic density of each geo-graphic location ` the preprocessing step also computes aprobability distribution function (PDF) π` of the length ofa trip that originates at ` In particular it computes thefollowing quantity for each location ` for all values of len

π`(len) = trips of length len that start in `

trips that start in `

This PDF is used by the endpoint generation algorithm tochoose appropriate geographic locations as sources for Sybilpaths

Temporal traffic patterns Traffic densities at a geographiclocation ` vary depending on the hour of the day and the dayof the week which results in temporal patterns in the valuesof τ` To reflect these temporal patterns in the features ofeach geographic location we define six kinds of temporalstates based on the patterns in our dataset namely ldquopeakintervalweekdayrdquo (7pm-11pm) ldquonormal intervalweekdayrdquo(6am-7pm and 11pm-3am) ldquooff-peak intervalweekdayrdquo(3am-6am) ldquopeak intervalweekendrdquo (7am-11pm) ldquonormalintervalweekendrdquo (6am-7pm and 11pm-3am) and ldquooff-peakintervalweekendrdquo (3am-6am) The precomputation stepcomputes six values of τ` for each location correspondingto each of the temporal states above When the user suppliesa source and a destination the endpoint generator choosesthe value of τ` to compute synthetic endpoints based uponthe timestamp in the userrsquos input

Representing geographic locations To compute τ` thepreprocessing algorithm needs an appropriate data structureto represent the location ` This data structure mustsimultaneously balance precision and scalability ie itmust store precise values of τ` for each location ` andmust readily scale to large geographic regions such as theSan Francisco Bay area To better illustrate this tradeoffsuppose that geographic locations are represented usingfixed-size rectangular blocks as was the case in an earlyimplementation of our prototype We found that a block sizeof 400mtimes400m resulted in a manageable number of blocksfor the entire Bay area (about 100000 blocks) but did notaccurately represent traffic densities in crowded areas suchas the downtown region and the airport On the other handa block size of 25mtimes25m accurately represented trafficdensities but resulted in a large number of blocks (about64 million blocks)

(a) San Francisco Bay areaThe airport appears on thelower right corner of thisfigure

(b) San Francisco airportBlack blocks have high traf-fic densities

Figure 2 Quadtree rep-resentations of geographiclocations

SybilQuery therefore usesan adaptive data structurethe Quadtree [12] to repre-sent traffic densities Thisdata structure represents ge-ographic locations using blocksof varying sizes (smallestsize is 25mtimes25m) depend-ing on the traffic density ofthat location In particu-lar geographic regions withnon-uniform distribution oftraffic are represented usingsmall block sizes while re-gions with a uniform dis-tribution of traffic are rep-resented using larger blocksizes Using the Quadtreedata structure we were ableto represent traffic densitiesfor the entire San FranciscoBay area using just 16000blocks

Output of preprocessingFigure 2 pictorially repre-sents the output of the pre-processing step Figure 2(a) shows the quadtree represen-tation of the entire San Francisco Bay area The adaptivenature of the Quadtree data structure results in differentblock sizes for different geographic locations Althougheach of these blocks has uniform traffic density in ourexperience areas with high traffic density tend to berepresented with smaller block sizes This effect is mostpronounced in areas with dense traffic such as the airportdepicted in Figure 2(b) Note that the regions with highesttraffic density such as the freeway and the pickup area infront of the airport are represented using small blocks whileareas inside the airport that have no traffic are representedusing larger block sizes Not all blocks have the sametraffic density in Figure 2(b) blocks with the highest trafficdensities are filled in black

Generating Sybil Endpoints When provided with a sourceand a destination input by the user the endpoint generatorcomputes a Sybil source destination pair as shown inAlgorithm 1 (our prototype adapts the same algorithm togenerate k minus 1 Sybil source destination pairs for brevitywe only illustrate the algorithm for one pair) It first finds thegeographic locations (in the Quadtree representation) `srcand `dst that contain the source and destination addressesinput by the user Next it computes the set of all sourcelocations ` that satisfy two conditions (1) the traffic densityof ` is approximately that of `src and (2) the probability ofa trip of length dist originating from ` matches that of `srcThe key intuition behind this step is to find the set of sourcelocations that closely resemble the source location input bythe user It randomly chooses a source location `srcprime fromthis set of locations and finds a destination location `dstprime

within a radius dist of `srcprime whose traffic density matchesthat of `dst The algorithm generates end points that are ofsimilar lengths but not exactly the same length as the real

Algorithm Generate Sybil Endpoints(src dst)Input (a) src Street address of userrsquos source

(b) dst Street address of userrsquos destinationOutput (srcprime dstprime) A Sybil sourcedestination pairdist = Euclidean distance between src and dst1`src `dst = geographic locations that contain src dst2Sources = Set of all locations ` with (i) τ` asymp τ`src and (ii) π`(dist)asymp π`src (dist)3`srcprime = random location isin Sources4Destinations = Set of all locations ` with (i) τ` asymp τ`dst and (ii) Euclidean distance5between `srcprime and `asymp dist`dstprime = random location isin Destinations6srcprime dstprime = Reverse geocode random points in `srcprime `dstprime 7return (srcprime dstprime)8

Algorithm 1 Generating Sybil endpoints

client path This is to allow for the client path or the Sybilpaths to make occasional detours in the path generation step

(a) Random point in geo-graphic location

(b) Street address closest torandom point via reversegeocoding

Figure 3 Finding a pointin a block using reversegeocoding

The last step is to identifyactual addresses within theSybil sourcedestination lo-cations just identified Todo so it randomly chooses apoint within the geographiclocation However thispoint may reside in non-driveable terrain as shownin Figure 3(a) If thispoint were returned as asourcedestination an adver-sary can easily identify thispoint as a Sybil address Toavoid such cases we reversegeocode this point (usingan API provided by GoogleMaps [18]) to find the streetaddress closest to the pointidentified Doing so greatlyimproves the quality of theendpoints generated by ouralgorithm Figure 3(b)shows the closest street ad-dress of the point identifiedin Figure 3(a)

The Path GeneratorThe SybilQuery prototype implements the path generatorusing the Microsoft Multimap API [35] When supplied witha source and a destination this API produces a sequenceof waypoints representing the path to the destination Asnoted earlier SybilQuery admits the use of any off-the-shelfpath generator such as those used by on-board navigationsystems This feature helps SybilQuery produce paths thatautomatically account for environmental factors such asroad closures and one ways

The Multimap API currently only produces the shortest pathto the destination thus making our prototype vulnerable toattack if the user follows a longer path to the destinationHowever as noted earlier this problem can easily be fixedwith the use of an API that generates a choice of paths tothe destination (or an API that allows the user to specify awaypoint that must be included enroute the destination) The

(a) Real trip The valuesof τ` for the source anddestination were 22 and 247respectively

(b) One of the Sybiltrips The values of τ`

for the source and des-tination were 22 and255 respectively

Figure 4 A real user trip and a Sybil trip generated bySybilQuery

path generation algorithm could then choose paths (both realand Sybil paths) based upon a probability distribution of thelength of path that users typically follow to their destination

The Query GeneratorThe basic version of SybilQueryrsquos query generator simulatesuser movement along each path It simulates the movementof users along Sybil paths at approximately the projectedspeed along that path (Information about the average speedalong a path is obtained is returned by most off-the-shelfpath generators) When the user sends a query to theLBS the query generator obtains the current location of thesimulated users and sends these as Sybil queries to the LBSWe have also interfaced the query generator with the Yahoolocal API [44] to more accurately simulate movement alongSybil paths under the constraints of current traffic Thus ifthere is traffic congestion at a particular location SybilQuerycan either simulate slower movement in that location orsimulate a detour (and request the path generator to generatea fresh path to the destination) Note that Sybil paths getcongested and decongested independently from real pathsWhen querying an LBS for online information the queryis made for all the Sybil locations as well as the real clientlocation so that the uncertainty at the LBS about the clientlocation remains the same Although querying an LBSthat reports current traffic conditions renders the prototypevulnerable to malicious LBSs that report false traffic datathe query generator can use the N -variant queries approachdescribed earlier to counter this threat

ExampleFigure 4 presents SybilQuery in action It shows a realuser trip (Figure 4(a)) obtained from the Cabspotter tracesfor another month and one of the Sybil trips generated bySybilQuery (Figure 4(b)) k was set to 4 for this exampleThe real trip (at approximately 6pm) originates at a homein Daly City and ends at a shopping area in downtown SanFrancisco All the Sybil trips generated by SybilQuery alsostarted in residential areas with similar traffic densities asthe source of the real trip and ended in a crowded region ofSan Francisco downtown Figure 4 also reports the trafficdensities for the sources and destinations for the real andSybil trip generated by SybilQuery A key point to notefrom this example is that traffic densities accurately capture

k questions correct Probability4 75 20 0266 75 14 019

Figure 5 Results of a user study with 15 volunteers

semantic properties (eg ldquoresidential areardquo ldquodowntownrdquoldquoshopping areardquo) of geographic regions

EVALUATIONIn this section we present the results of our evaluation ofthe SybilQuery prototype We evaluated both the privacyand the performance offered by SybilQuery

PrivacyTo evaluate the quality of the Sybil paths generated by oursystem we conducted a user study The high-level ideabehind the user study was to give the working system toadversarial users who would try to break the system bytrying to find the real user path hidden between Sybil pathsOur methodology was as followsmdashwe picked real pathsat random from the Cabspotter traces (different from themonth-long traces that were used to seed our prototypersquosendpoint generator) and used SybilQuery to generate Sybilpaths with different values of k We also performed aquantitative evaluation of privacy by devising a new metricThe intuition behind this metric is that an adversary who hasaccess to a database R of real paths and a database S ofSybil paths corresponding to paths in R should be unableto differentiate between the two databases We presentthe results from this evaluation in our detailed technicalreport [40]

User Study We conducted the user study with 15volunteers each of whom had to answer ten questions Eachquestion presented k paths (one real path and k minus 1 Sybilpaths) to the volunteer who had to identify the real pathfrom the Sybil paths Five questions had k = 4 while theremaining five had k = 6 Each volunteer was presentedwith a different set of ten questions ie our study obtainedresponses for 150 questions in all Volunteers could viewpaths using the Google Maps API and were instructed touse any tools at their disposal such as zooming into the mapstreet views and local information about the San FranciscoBay area in their attempt to distinguish real paths from Sybilpaths All our volunteers were computer science graduatesfamiliar with the basic geography of the San Francisco Bayarea

Figure 5 presents the results of this study and shows thenumber of questions for which users were able to correctlyidentify the real path from the Sybil paths As this figureindicates volunteers were able to correctly guess the realpath with a probability of 026 for k = 4 and 019 fork = 6 These probabilities are close to their expectedvalues (025 and 017 respectively) We further calculatedthe per-user standard deviation of the probability of correctguesses and found these values to be quite low (019 and015 respectively) These results lead us to conclude thatthe Sybil paths qualitatively resemble real paths

Figure 6 Queryresponselatency of SybilQuery

Figure 7 Spatial cloakingIncrease of cloaked regionradius with trip length

Sample responses from the user study gave us an interestingperspective into the mind of an adversary trying to breakthe SybilQuery system Several users selected a false tripas the ldquoodd man outrdquo since all the other trips (includingthe real trip) were similar Different users used contrastingrationale to guess real user trips For example some userswrongly selected shortest paths since the real path ldquolookedtoo circuitous to be truerdquo Yet others wrongly selectedcircuitous paths as they felt that ldquooutside human knowledgecould be directing this routerdquo Users typically found it hardto guess when the trip started late (ieif the first querywas made after travelling some distance) since ldquothe startingpoint seemed a bit oddrdquo Sybil paths with a prominentendpoint such as a University or a Metro station were oftenselected by users with responses such as ldquodestination isUniversity (different from others) and more likely to be thereal userrsquos queryrdquo

To summarize the user study the reasonings used by theadversaries were extremely diverse which seems to explainwhy the overall probability of guessing correctly was closeto that of random guessing

PerformanceWe report the performance of several aspects of SybilQueryincluding the time needed to preprocess the traffic databaseand the real-time performance of querying an LBS Wealso compare the performance of SybilQuery against animplementation of spatial cloaking All the results presentedbelow were obtained on a 173GHz Pentium M laptop with512 MB RAM Each experiment was repeated 50 times andthe value of the privacy parameter k was fixed to be 4 unlessotherwise indicated

One-Time and Once-Per-Trip Costs The one-time cost(offline step) for SybilQuery involves preprocessing of thetraffic database This step processed 529533 trips and tookapproximately 2 hours and 16 minutes

The once-per-trip costs for SybilQuery involve end-pointgeneration and path generation Generating a single pair ofendpoints at the beginning of a user trip took an average of547 seconds with a standard deviation of 102 seconds Tomeasure the cost of generating paths we randomly chosetrips with lengths varying between 5 and 30 minutes Themean cost of computing a path (including the networklatency to query the Microsoft MultiMap API) was 360milliseconds with a standard deviation of 150 milliseconds

Figure 8 Response size returned by Sybil-Query and spatial cloaking for mediumPOI density

Figure 9 Response size returned by spa-tial cloaking for different POI densities

Figure 10 Comparing the performance ofSybilQuery with spatial cloaking for rangequeries

QueryResponse Performance We measured the responselatency of SybilQuery by integrating it with the YahooMaps local search API [43] Figure 6 shows the latency ofsending k queries and receiving responses As expected thecost increases linearly with k since k requests are sent to theLBS each time the client makes a query

Comparison with Spatial Cloaking Spatial cloaking [19]is a state of the art technique that achieves k-anonymityby using an anonymizer to send the LBS cloaked regionscontaining at least k users The LBS returns all pointsof interest (POIs) within the cloaked region that match thequery to the anonymizer Although spatial cloaking wasoriginally proposed for static users we considered a simplevariant for mobile users in which cloaked regions grow asusers move2

We conducted experiments to compare spatial cloaking withSybilQuery Because spatial cloaking uses an anonymizerto send cloaked regions containing k clients to the LBS wecan expect the cloaked regions to increase in size as clientsmove (because the cloaked region must contain the same setof clients to preserve k-anonymity) In addition becausean LBS returns query results for the cloaked region wecan expect that increased POI density will lead to increasedqueryresponse processing time at the anonymizer Ourexperiments reported below confirmed these expectations

Figure 7 presents the results of the experiment that showsthat the size of cloaked regions increases as clients moveWe randomly chose trips varying in duration from 1 minuteto 30 minutes from the Cabspotter database We then fixedk = 4 and selected the smallest cloaked region containingat least k minus 1 other clients As Figure 7 shows the size ofthe cloaked region increases with the duration of the trip

To understand how this increase translates to queryresponseprocessing overhead we studied the size of the response(in kilobytes) received from the LBS for a fixed nearestneighbor query ldquoreturn the Chinese restaurant closest tomy current locationrdquo In spatial cloaking the anonymizersends the entire cloaked region to the LBS and must process2Although we have not carefully analyzed the privacy offered bythis scheme it suffices to compare the queryresponse performanceof spatial cloaking with that of SybilQuery

responses from the LBS to identify the closest Chineserestaurant for the client issuing the query However becauseSybilQuery sends exact street addresses the LBS sends theclosest Chinese restaurant Figure 8 compares the sizes ofthe query response for spatial cloaking and SybilQuery Asthis figure shows the query response size for SybilQueryis nearly a constant (at approximately 2KB) it only variesonly with the value of k However the response size forspatial cloaking increases linearly This is because the sizeof the cloaked region increases with trip length which inturn directly translates to an increased number of POIs thatmatch the userrsquos query

We also conducted an experiment to study the effect ofincreased POI densities on queryresponse performanceWe chose three representative nearest neighbor queries thatrequired the LBS to return the closest ldquorestaurantrdquo ldquoChineserestaurantrdquo and ldquoHunan Chinese restaurantrdquo to the clientrsquoslocation For spatial cloaking these queries representrespectively high medium and low POI densities AsFigure 9 shows with spatial cloaking the size of the queryresults depends both on POI density and on the duration ofthe trip and rises to about 5MB within 30 minutes for highPOI densities With SybilQuery the query response sizeremains fixed at 2KB (because only the closest restaurantis returned)

The experiments above used nearest-neighbour queries forwhich the difference in query responses for spatial cloakingand SybilQuery are most pronounced In contrast rangequeries such as ldquoreturn the list of all Chinese restaurants inan x-mile radiusrdquo require the LBS to process the query overan entire geographic region If a client issuing range queriesuses SybilQuery to protect his location the LBS mustprocess k geographic regions each of x-mile radius On theother hand if the client uses spatial cloaking the LBS mustonly process one geographic region whose radius is x mileslarger than the cloaked region We can therefore expectspatial cloaking to outperform SybilQuery as x increases

Figure 10 shows that this is indeed the case This figureplots the query response size against increasing values ofx for both SybilQuery and spatial cloaking For thisexperiment we assumed a circular cloaked region with afixed radius 5 kilometers and used SybilQuery with k = 4For smaller values of x SybilQuery outperforms spatial

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 2: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

Sybil queries contain locations that resemble the clientrsquosactual location For example Sybil queries for a real queryfrom a busy downtown area will also be from areas withsimilar traffic conditions SybilQuery sends these k queriesto the LBS which is unable to identify the original queryfrom the synthetic queries thereby ensuring k-anonymityof the clientrsquos location SybilQueryrsquos design offers severaladvantages

(a) Performance SybilQuery eliminates the need foranonymizing servers and the scalability and performancebottlenecks associated with centralized designs For exam-ple choosing higher values of k for better privacy leads togreater computational overheads at the anonymizer [1 34]SybilQueryrsquos decentralized design offloads the computationsneeded to achieve k-anonymity to the clients themselvesOur experiments show that the costs of generating Sybilqueries at the client are negligible(b) Autonomy In contrast to prior work where queriesgenerated by an anonymizer (or peer-to-peer techniques)depend on other clients of the LBS SybilQuery generatesSybil queries autonomously For example cloaked regionsgenerated by spatial cloaking techniques [19] must haveat least k clients this in turn depends on the geographicdistribution of clients [26](c) Ease of deployment A decentralized and autonomousdesign lends itself to easy deployment Clients do not haveto rely on peer participation or on service providers to deployanonymizers in order to achieve k-anonymity Unlike PIR-based techniques SybilQuery does not require any changesto the LBS and only requires minor modifications to thequerying client We have integrated SybilQuery with LBSssuch as Google Maps Live Maps and Yahoo Maps

We have implemented a prototype of SybilQuery for vehic-ular networking applications Vehicles today are increas-ingly being equipped with computing device having GPScapability and internet connectivity which makes vehiclesan interesting ubiquitous computing environment OurSybilQuery system takes as input a path to be followed by avehicle along which the vehicle may issue several queries toLBSs SybilQuery outputs kminus1 Sybil paths that statisticallyresemble the input path The vehicle sends k queries whenit accesses an LBSmdashits actual location as well as k minus 1waypoints derived from the Sybil paths We evaluatedthis prototype using real mobility traces of approximately500 cabs in San Francisco [2] We also conducted a userstudy with 15 volunteers which confirmed that Sybil pathsproduced by SybilQuery closely resemble real paths

PROBLEM DEFINITION AND ASSUMPTIONSAn LBS is a database that stores a set of tuples lt ` v gtwhere `rsquos are geographic locations eg represented usinglatitudeslongitudes and each v denotes value(s) associatedwith location ` eg restaurants or current traffic conditionsat ` We consider a model in which mobile clientsperiodically query the LBS as they move from a source toa destination Each query contains the current location ofthe client and the LBS returns the set of values associatedwith that location

The problem of privately querying an LBS is for a clientto issue location-based queries without revealing its currentlocation to an adversary The SybilQuery system aims toachieve this goal by sending k minus 1 Sybil queries along witheach real query by the client These queries contain locationsalong k minus 1 Sybil paths which are synthetically generatedby SybilQuery To prevent an adversary from identifyingthe clientrsquos actual location SybilQuery must generate Sybilpaths (and thereby Sybil queries) that closely resemble realpaths

We make the following assumptions about the adversary

(a) Adversary cannot access clientrsquos identifiers A fully-constructed client query to the LBS logically consists ofthree entities (i) a network identifier such as the IP addressof the clientrsquos device that issues the location-based query(ii) an explicit user identifier such as the userrsquos registeredaccount name at the LBS or an HTTP cookie and (iii) theclientrsquos location as may be obtained by the client using GPSWe assume that the adversary does not have access to theclientrsquos network identifiers To hide network identifiers fromthe adversary we assume that the clientrsquos service provider(eg the cellular provider) and local access points whichknow the clientrsquos current location do not collude with theadversary (eg the LBS) We also assume that the LBS doesnot require the user to send explicit user identifiers in orderto use the service This is indeed a practical assumption(and is also a standard assumption in prior work) severalpopular LBSs such as Google Maps Microsoft Live Mapsand Yahoo Maps do not require a registered user or HTTPcookies for a client to usefully query them We furtherassume that the adversary does not use the query contentto infer the clientrsquos location Thus the adversary can onlyobserve the location field in each client query(b) Adversary could be active or passive A passiveadversary can observe the contents of all location-basedqueries as well as the responses received from the LBS Onthe other hand an active adversary can modify responsesfrom the LBS as a way to compromise client privacy Forexample consider an adversarial LBS that reports real-time traffic information Upon receiving a request fortraffic conditions at a particular location ` the LBS couldreturn false information eg by reporting heavy trafficat ` This response from the LBS may cause the clientto behave differently eg take a detour to avoid heavytraffic The SybilQuery system must therefore respondappropriately eg by appropriately modifying the Sybilqueries to statistically match the real query(c) Statistical background knowledge The adversarycould have access to global statistical knowledge suchas the average traffic densities at different locations atdifferent times during the day This information canpotentially be used to compromise client privacy TheSybilQuery prototype can currently defend against severalsuch statistical background knowledge attacks particularlythose that use knowledge about regional traffic Howeverbecause we cannot anticipate all such attacks we havedesigned SybilQuery to support an extensible architectureIn particular it can incorporate several statistical featuresin its algorithm to generate Sybil paths thereby making

EndpointGenerator

QueryGenerator

PathGenerator

Source amp Destination

Value of k

Original andk-1 synthetic

endpoints

Trafficstatistics

Regionalmaps

Path P andk-1 synthetic

paths

Client location l

Location l ampk-1 synthetic

locations

Response

Locationbasedservice

Figure 1 Basic design of SybilQuery

the system robust to future attacks SybilQuery howevercannot defend against adversarial LBSs that have targettedbackground information about a specific client If theLBS knows a particular clientrsquos daily commuting patternsor knows specific locations that the client visits it candifferentiate between real queries and Sybil queries sent bythat client with high probability For example if the LBSknows a clientrsquos residential address (R) and his work address(W ) it can identify the real path followed by the client bysearching for a path that starts at R and ends at W

HIDING CLIENT LOCATION USING SYBIL QUERIESThe SybilQuery system presents an interface akin to existingnavigation systems A user enters the address of the sourceand destination of a trip and the system outputs a path Pfor the user to follow A destination prediction schemesuch as the one proposed by Krumm and Horvitz [28]can be optionally used to alleviate the need for the userto enter the destination Additionally SybilQuery requiresas input a security parameter k that it uses to generatek minus 1 Sybil paths Each path is represented by the systemas a sequence of waypoints enroute from the source to thedestination As depicted in Figure 1 SybilQuery consistsof three modules an endpoint generator a path generatorand a query generator We describe each of these modules indetail below

The endpoint generator produces k minus 1 synthetic startand end points that statistically resemble the source anddestination input by the user so that the real trip doesnot stand out from the synthetic trips To do so theendpoint generator requires a database of regional trafficstatistics This database reflects historic traffic trendsat various geographic locations in the area (it is not adatabase of real-time traffic conditions) The high levelidea is that the endpoint generator processes this databaseto identify clusters of locations that share similar featuressuch as traffic density (details of specific features usedin our implementation are deferred to the next section)This process is a one-time activity and suffices to producesynthetic start and end points for multiple trips eg untilthe database is refreshed with newer statistics The endpointgenerator chooses synthetic start and end points from theclusters to which the actual source and destination belong Indoing so it also ensures that the Euclidean distance betweenthe source and destination of the actual path are within athreshold of the Euclidean distance between the syntheticstart and end points

The path generator uses the k start and end pointsincluding the actual source and destination to produce k

paths each represented as a sequence of waypoints Ingenerating paths it consults a database of regional maps (asdo existing on-board navigation systems) In fact the pathgenerator may be implemented using existing navigationsystems such as Google Maps Yahoo Maps and on-boardGPS devices because their interfaces are similar Both theendpoint generator and path generator need to be invokedonly once per trip

The query generator is triggered when the user queries theLBS with his current location ` Intuitively it simulates themotion of users along the k minus 1 Sybil paths and generatesk minus 1 Sybil locations In its simplest form the querygenerator simply computes the offset of ` from the source ofthe userrsquos path P and applies similar offsets to the sourcesof the Sybil paths to produce kminus1 Sybil locations Howeverit can also use current traffic conditions to more accuratelysimulate user movement along Sybil paths (eg simulateslower movement if traffic is congested at one of the Sybillocations)

The modular design of SybilQuery offers several advan-tages It lends itself to easy deployment even with legacydevices The endpoint generator is a standalone componentthat can possibly be installed as a user-level applicationeither on the userrsquos desktop or on his mobile deviceThe path generator can be implemented using off-the-shelfnavigation systems Using a navigation system enablesrobust path generation based on regional maps which allowsSybilQuery to generate Sybil paths that respect features suchas one-ways and road closures It also allows SybilQueryto robustly handle detours from the path P suggested bythe navigation system Upon a detour SybilQuery cansimply trigger the path generator to produce a new path tothe destination Because paths are generated independentof each other SybilQuery can also simulate detours inSybil paths SybilQueryrsquos design only requires the querygenerator to be modified to send k queries to the LBS insteadof one Based upon the deployment scenario even thismodification should be relatively easy For example if anend user employs a Web browser on his mobile phone tosend location-based queries to an LBS the query generatorcould be implemented as a browser extension

Several extensions to the basic design above can improve therobustness of the system to potential attacks

(a) Randomizing path selection Path generators im-plemented using navigation systems typically return theshortest route to the destination However real users maynot always follow the shortest path to the destination becauseof factors such as detours and road closures If SybilQueryalways produces shortest Sybil paths and the user choosesa longer path to the destination an adversary will be ableto differentiate the real path from Sybil paths with highprobabilityThis problem is addressed with path generators that cancompute multiple paths to the destination (each with varyinglengths) Instead of choosing the shortest path the pathgenerator instead uses a probability distribution (of thefrequency with which users choose paths other than the

shortest path) to make an appropriate choice from one of theavailable paths to the destination(b) Handling active adversaries An actively adversarialLBS may return doctored query responses as an attemptto differentiate Sybil paths from a clientrsquos real path Forexample suppose that an adversarial LBS falsely reportstraffic congestion at a particular location in response to a userquery In response a real user will likely take a detour If theSybilQuery system does not respond similarly to doctoredresponses (eg by checking for congestion and triggeringdetours in Sybil paths) an adversary could likely distinguishthe real path from Sybil pathsSybilQuery handles active adversaries usingN -variant queriesIn this technique SybilQuery queries multiple LBSs andcompares their responses in a manner akin to N -variantsystems [6] Unless all the LBSs queried collude with eachother adversarial LBSs are likely to be detected(c) Endpoint caching Suppose that a real path P fre-quented by the user (eg commuter paths such as hometo office) is associated with multiple sets of Sybil pathsAn LBS that observes paths over a period of time canstatistically identify P as the real path Similarly when auser travels to a new location shortly after arriving at thefirst destination if the second set of Sybil trips are randomlygenerated then the real paths can be easily distinguishedfrom the Sybil paths since the real paths share an endpointSybilQuery handles the above attacks by performing threetypes of caching (1) for the most common trips by the userthe Sybil endpoints are cached (2) if the user makes multipletrips from one common endpoint (eg his home or office)the corresponding Sybil endpoints are cached and (3) whenthe user embarks on a multi-destination trip the start pointsof the Sybil trips are cached so that the endpoint of a trip isthe same as the startpoint of the following trip(d) Providing path continuity Although the paths gen-erated by SybilQuery statistically resemble each other itis possible for the trip durations to differ (for both realand Sybil trips) If a real trip ends before some of theSybil trips end and the system stops sending queriesthe LBS can differentiate the real path from Sybil pathsSybilQuery guards against this by being an ldquoalways onrdquo toolthat continues to simulate movement along Sybil paths evenwhen the userrsquos real trip is complete This feature ensuressecurity even when the lengthsdurations of Sybil trips donot match those of real trips(e) Adding GPS sensor noise The location returned byGPS devices is typically inaccurate with the errors follow-ing a Gaussian distribution [7] If the Sybil locations alwaysfall on a road segment an adversary can differentiate themfrom locations returned by real GPS devices SybilQuerycan prevent this by adding a random noise to each Sybillocation

THE SybilQuery PROTOTYPEIn this section we describe the implementation of ourSybilQuery prototype The bulk of this section is devotedto the description of the endpoint generator which is imple-mented as a Python client that uses a PostgreSQL databasebackend with PostGIS spatial extensions to store regionaltraffic information The path generator is implemented asa client that queries an off-the-shelf service [35] to find the

set of waypoints corresponding to the shortest path givena startpoint and an endpoint Finally the query generatorsimulates user movement along all the k paths

The Endpoint GeneratorAs discussed in the previous section the endpoint generatoruses the source and destination addresses input by the userto generate k minus 1 synthetic endpoints To ensure thatthese synthetic endpoints statistically resemble the actualendpoints of the userrsquos trip the endpoint generator usesa regional traffic database The endpoint generator firstpreprocesses the database to identify key features of eachgeographic location

Regional Traffic Database For our prototype we useda month-long (August 25 2008ndashSeptember 24 2008) setof GPS traces from the Cabspotting project [2] as regionaltraffic database The Cabspotting project tracks the mobilityof cabs in the San Francisco Bay area1 Each cab thatparticipated in this project was outfit with a GPS device thatupdated its location with a server each minute Each of theseupdates was a quadruplelttimestamp cab ID current location flaggtHere flag indicates whether the cab was metered (ie ina trip) or empty We used flag to convert the GPS tracesinto trips (ie to demarcate sources and destinations duringwhich the cab was metered) Overall our database containsa total of 529533 trips by 530 unique cabs we observed atleast 447 cabs on any given day Although we tailor ourdiscussion in the rest of this section to this database weemphasize that the techniques employed by SybilQuery areapplicable to any traffic database

Preprocessing the Traffic Database An ideal implemen-tation of the endpoint generator would use an annotateddatabase of the local region to identify synthetic endpointsthat resemble the endpoints input by the user Theannotated database would contain descriptive tags for eachgeographic location such as ldquoparking lotrdquo ldquodowntownoffice buildingrdquo or ldquofreewayrdquo The endpoint generatorwould output synthetic sources and destinations whose tagsmatch the corresponding tags of the source and destinationsupplied by the user However such annotated databasesare laborious to create and even so are unlikely to becomprehensive or contain tags suitable for LBSs in differentapplication domains

The preprocessing step addresses this problem by automati-cally computing a feature set that describes each geographiclocation using information from the traffic database Theseautomatically extracted features then serve as tags that theendpoint generator can use to find synthetic sources anddestinations that statistically resemble the userrsquos input

Our implementation uses traffic density as the feature thatcharacterizes each geographic location We define the trafficdensity τ` of a geographic location ` as the number of tripsthat start end or traverse through ` in a fixed interval of time1Consequently our experiments also focus on trips in thisgeographic region However SybilQuery can automaticallygenerate Sybil queries for any geographic region if provided with asimilar traffic database for that region

(we describe in detail below our how our system representsgeographic locations for simplicity it suffices to think ofthem as fixed-size rectangular regions)We calculated τ`for each location ` using the PostGIS spatial extension ofPostgreSQL Because τ` can acquire a large number ofdiscrete values we used a simple clustering algorithm togroup ldquosimilarrdquo values of τ` into a single cluster Althoughany clustering technique may be used we found that evensimple clustering algorithms such as separating values of τ`into buckets serve our purpose well We empirically verifiedthat these clusters of the San Francisco Bay area sharesimilar semantic properties For example different shoppingcenters were grouped together into the same cluster as wereresidential areas

In addition to computing the traffic density of each geo-graphic location ` the preprocessing step also computes aprobability distribution function (PDF) π` of the length ofa trip that originates at ` In particular it computes thefollowing quantity for each location ` for all values of len

π`(len) = trips of length len that start in `

trips that start in `

This PDF is used by the endpoint generation algorithm tochoose appropriate geographic locations as sources for Sybilpaths

Temporal traffic patterns Traffic densities at a geographiclocation ` vary depending on the hour of the day and the dayof the week which results in temporal patterns in the valuesof τ` To reflect these temporal patterns in the features ofeach geographic location we define six kinds of temporalstates based on the patterns in our dataset namely ldquopeakintervalweekdayrdquo (7pm-11pm) ldquonormal intervalweekdayrdquo(6am-7pm and 11pm-3am) ldquooff-peak intervalweekdayrdquo(3am-6am) ldquopeak intervalweekendrdquo (7am-11pm) ldquonormalintervalweekendrdquo (6am-7pm and 11pm-3am) and ldquooff-peakintervalweekendrdquo (3am-6am) The precomputation stepcomputes six values of τ` for each location correspondingto each of the temporal states above When the user suppliesa source and a destination the endpoint generator choosesthe value of τ` to compute synthetic endpoints based uponthe timestamp in the userrsquos input

Representing geographic locations To compute τ` thepreprocessing algorithm needs an appropriate data structureto represent the location ` This data structure mustsimultaneously balance precision and scalability ie itmust store precise values of τ` for each location ` andmust readily scale to large geographic regions such as theSan Francisco Bay area To better illustrate this tradeoffsuppose that geographic locations are represented usingfixed-size rectangular blocks as was the case in an earlyimplementation of our prototype We found that a block sizeof 400mtimes400m resulted in a manageable number of blocksfor the entire Bay area (about 100000 blocks) but did notaccurately represent traffic densities in crowded areas suchas the downtown region and the airport On the other handa block size of 25mtimes25m accurately represented trafficdensities but resulted in a large number of blocks (about64 million blocks)

(a) San Francisco Bay areaThe airport appears on thelower right corner of thisfigure

(b) San Francisco airportBlack blocks have high traf-fic densities

Figure 2 Quadtree rep-resentations of geographiclocations

SybilQuery therefore usesan adaptive data structurethe Quadtree [12] to repre-sent traffic densities Thisdata structure represents ge-ographic locations using blocksof varying sizes (smallestsize is 25mtimes25m) depend-ing on the traffic density ofthat location In particu-lar geographic regions withnon-uniform distribution oftraffic are represented usingsmall block sizes while re-gions with a uniform dis-tribution of traffic are rep-resented using larger blocksizes Using the Quadtreedata structure we were ableto represent traffic densitiesfor the entire San FranciscoBay area using just 16000blocks

Output of preprocessingFigure 2 pictorially repre-sents the output of the pre-processing step Figure 2(a) shows the quadtree represen-tation of the entire San Francisco Bay area The adaptivenature of the Quadtree data structure results in differentblock sizes for different geographic locations Althougheach of these blocks has uniform traffic density in ourexperience areas with high traffic density tend to berepresented with smaller block sizes This effect is mostpronounced in areas with dense traffic such as the airportdepicted in Figure 2(b) Note that the regions with highesttraffic density such as the freeway and the pickup area infront of the airport are represented using small blocks whileareas inside the airport that have no traffic are representedusing larger block sizes Not all blocks have the sametraffic density in Figure 2(b) blocks with the highest trafficdensities are filled in black

Generating Sybil Endpoints When provided with a sourceand a destination input by the user the endpoint generatorcomputes a Sybil source destination pair as shown inAlgorithm 1 (our prototype adapts the same algorithm togenerate k minus 1 Sybil source destination pairs for brevitywe only illustrate the algorithm for one pair) It first finds thegeographic locations (in the Quadtree representation) `srcand `dst that contain the source and destination addressesinput by the user Next it computes the set of all sourcelocations ` that satisfy two conditions (1) the traffic densityof ` is approximately that of `src and (2) the probability ofa trip of length dist originating from ` matches that of `srcThe key intuition behind this step is to find the set of sourcelocations that closely resemble the source location input bythe user It randomly chooses a source location `srcprime fromthis set of locations and finds a destination location `dstprime

within a radius dist of `srcprime whose traffic density matchesthat of `dst The algorithm generates end points that are ofsimilar lengths but not exactly the same length as the real

Algorithm Generate Sybil Endpoints(src dst)Input (a) src Street address of userrsquos source

(b) dst Street address of userrsquos destinationOutput (srcprime dstprime) A Sybil sourcedestination pairdist = Euclidean distance between src and dst1`src `dst = geographic locations that contain src dst2Sources = Set of all locations ` with (i) τ` asymp τ`src and (ii) π`(dist)asymp π`src (dist)3`srcprime = random location isin Sources4Destinations = Set of all locations ` with (i) τ` asymp τ`dst and (ii) Euclidean distance5between `srcprime and `asymp dist`dstprime = random location isin Destinations6srcprime dstprime = Reverse geocode random points in `srcprime `dstprime 7return (srcprime dstprime)8

Algorithm 1 Generating Sybil endpoints

client path This is to allow for the client path or the Sybilpaths to make occasional detours in the path generation step

(a) Random point in geo-graphic location

(b) Street address closest torandom point via reversegeocoding

Figure 3 Finding a pointin a block using reversegeocoding

The last step is to identifyactual addresses within theSybil sourcedestination lo-cations just identified Todo so it randomly chooses apoint within the geographiclocation However thispoint may reside in non-driveable terrain as shownin Figure 3(a) If thispoint were returned as asourcedestination an adver-sary can easily identify thispoint as a Sybil address Toavoid such cases we reversegeocode this point (usingan API provided by GoogleMaps [18]) to find the streetaddress closest to the pointidentified Doing so greatlyimproves the quality of theendpoints generated by ouralgorithm Figure 3(b)shows the closest street ad-dress of the point identifiedin Figure 3(a)

The Path GeneratorThe SybilQuery prototype implements the path generatorusing the Microsoft Multimap API [35] When supplied witha source and a destination this API produces a sequenceof waypoints representing the path to the destination Asnoted earlier SybilQuery admits the use of any off-the-shelfpath generator such as those used by on-board navigationsystems This feature helps SybilQuery produce paths thatautomatically account for environmental factors such asroad closures and one ways

The Multimap API currently only produces the shortest pathto the destination thus making our prototype vulnerable toattack if the user follows a longer path to the destinationHowever as noted earlier this problem can easily be fixedwith the use of an API that generates a choice of paths tothe destination (or an API that allows the user to specify awaypoint that must be included enroute the destination) The

(a) Real trip The valuesof τ` for the source anddestination were 22 and 247respectively

(b) One of the Sybiltrips The values of τ`

for the source and des-tination were 22 and255 respectively

Figure 4 A real user trip and a Sybil trip generated bySybilQuery

path generation algorithm could then choose paths (both realand Sybil paths) based upon a probability distribution of thelength of path that users typically follow to their destination

The Query GeneratorThe basic version of SybilQueryrsquos query generator simulatesuser movement along each path It simulates the movementof users along Sybil paths at approximately the projectedspeed along that path (Information about the average speedalong a path is obtained is returned by most off-the-shelfpath generators) When the user sends a query to theLBS the query generator obtains the current location of thesimulated users and sends these as Sybil queries to the LBSWe have also interfaced the query generator with the Yahoolocal API [44] to more accurately simulate movement alongSybil paths under the constraints of current traffic Thus ifthere is traffic congestion at a particular location SybilQuerycan either simulate slower movement in that location orsimulate a detour (and request the path generator to generatea fresh path to the destination) Note that Sybil paths getcongested and decongested independently from real pathsWhen querying an LBS for online information the queryis made for all the Sybil locations as well as the real clientlocation so that the uncertainty at the LBS about the clientlocation remains the same Although querying an LBSthat reports current traffic conditions renders the prototypevulnerable to malicious LBSs that report false traffic datathe query generator can use the N -variant queries approachdescribed earlier to counter this threat

ExampleFigure 4 presents SybilQuery in action It shows a realuser trip (Figure 4(a)) obtained from the Cabspotter tracesfor another month and one of the Sybil trips generated bySybilQuery (Figure 4(b)) k was set to 4 for this exampleThe real trip (at approximately 6pm) originates at a homein Daly City and ends at a shopping area in downtown SanFrancisco All the Sybil trips generated by SybilQuery alsostarted in residential areas with similar traffic densities asthe source of the real trip and ended in a crowded region ofSan Francisco downtown Figure 4 also reports the trafficdensities for the sources and destinations for the real andSybil trip generated by SybilQuery A key point to notefrom this example is that traffic densities accurately capture

k questions correct Probability4 75 20 0266 75 14 019

Figure 5 Results of a user study with 15 volunteers

semantic properties (eg ldquoresidential areardquo ldquodowntownrdquoldquoshopping areardquo) of geographic regions

EVALUATIONIn this section we present the results of our evaluation ofthe SybilQuery prototype We evaluated both the privacyand the performance offered by SybilQuery

PrivacyTo evaluate the quality of the Sybil paths generated by oursystem we conducted a user study The high-level ideabehind the user study was to give the working system toadversarial users who would try to break the system bytrying to find the real user path hidden between Sybil pathsOur methodology was as followsmdashwe picked real pathsat random from the Cabspotter traces (different from themonth-long traces that were used to seed our prototypersquosendpoint generator) and used SybilQuery to generate Sybilpaths with different values of k We also performed aquantitative evaluation of privacy by devising a new metricThe intuition behind this metric is that an adversary who hasaccess to a database R of real paths and a database S ofSybil paths corresponding to paths in R should be unableto differentiate between the two databases We presentthe results from this evaluation in our detailed technicalreport [40]

User Study We conducted the user study with 15volunteers each of whom had to answer ten questions Eachquestion presented k paths (one real path and k minus 1 Sybilpaths) to the volunteer who had to identify the real pathfrom the Sybil paths Five questions had k = 4 while theremaining five had k = 6 Each volunteer was presentedwith a different set of ten questions ie our study obtainedresponses for 150 questions in all Volunteers could viewpaths using the Google Maps API and were instructed touse any tools at their disposal such as zooming into the mapstreet views and local information about the San FranciscoBay area in their attempt to distinguish real paths from Sybilpaths All our volunteers were computer science graduatesfamiliar with the basic geography of the San Francisco Bayarea

Figure 5 presents the results of this study and shows thenumber of questions for which users were able to correctlyidentify the real path from the Sybil paths As this figureindicates volunteers were able to correctly guess the realpath with a probability of 026 for k = 4 and 019 fork = 6 These probabilities are close to their expectedvalues (025 and 017 respectively) We further calculatedthe per-user standard deviation of the probability of correctguesses and found these values to be quite low (019 and015 respectively) These results lead us to conclude thatthe Sybil paths qualitatively resemble real paths

Figure 6 Queryresponselatency of SybilQuery

Figure 7 Spatial cloakingIncrease of cloaked regionradius with trip length

Sample responses from the user study gave us an interestingperspective into the mind of an adversary trying to breakthe SybilQuery system Several users selected a false tripas the ldquoodd man outrdquo since all the other trips (includingthe real trip) were similar Different users used contrastingrationale to guess real user trips For example some userswrongly selected shortest paths since the real path ldquolookedtoo circuitous to be truerdquo Yet others wrongly selectedcircuitous paths as they felt that ldquooutside human knowledgecould be directing this routerdquo Users typically found it hardto guess when the trip started late (ieif the first querywas made after travelling some distance) since ldquothe startingpoint seemed a bit oddrdquo Sybil paths with a prominentendpoint such as a University or a Metro station were oftenselected by users with responses such as ldquodestination isUniversity (different from others) and more likely to be thereal userrsquos queryrdquo

To summarize the user study the reasonings used by theadversaries were extremely diverse which seems to explainwhy the overall probability of guessing correctly was closeto that of random guessing

PerformanceWe report the performance of several aspects of SybilQueryincluding the time needed to preprocess the traffic databaseand the real-time performance of querying an LBS Wealso compare the performance of SybilQuery against animplementation of spatial cloaking All the results presentedbelow were obtained on a 173GHz Pentium M laptop with512 MB RAM Each experiment was repeated 50 times andthe value of the privacy parameter k was fixed to be 4 unlessotherwise indicated

One-Time and Once-Per-Trip Costs The one-time cost(offline step) for SybilQuery involves preprocessing of thetraffic database This step processed 529533 trips and tookapproximately 2 hours and 16 minutes

The once-per-trip costs for SybilQuery involve end-pointgeneration and path generation Generating a single pair ofendpoints at the beginning of a user trip took an average of547 seconds with a standard deviation of 102 seconds Tomeasure the cost of generating paths we randomly chosetrips with lengths varying between 5 and 30 minutes Themean cost of computing a path (including the networklatency to query the Microsoft MultiMap API) was 360milliseconds with a standard deviation of 150 milliseconds

Figure 8 Response size returned by Sybil-Query and spatial cloaking for mediumPOI density

Figure 9 Response size returned by spa-tial cloaking for different POI densities

Figure 10 Comparing the performance ofSybilQuery with spatial cloaking for rangequeries

QueryResponse Performance We measured the responselatency of SybilQuery by integrating it with the YahooMaps local search API [43] Figure 6 shows the latency ofsending k queries and receiving responses As expected thecost increases linearly with k since k requests are sent to theLBS each time the client makes a query

Comparison with Spatial Cloaking Spatial cloaking [19]is a state of the art technique that achieves k-anonymityby using an anonymizer to send the LBS cloaked regionscontaining at least k users The LBS returns all pointsof interest (POIs) within the cloaked region that match thequery to the anonymizer Although spatial cloaking wasoriginally proposed for static users we considered a simplevariant for mobile users in which cloaked regions grow asusers move2

We conducted experiments to compare spatial cloaking withSybilQuery Because spatial cloaking uses an anonymizerto send cloaked regions containing k clients to the LBS wecan expect the cloaked regions to increase in size as clientsmove (because the cloaked region must contain the same setof clients to preserve k-anonymity) In addition becausean LBS returns query results for the cloaked region wecan expect that increased POI density will lead to increasedqueryresponse processing time at the anonymizer Ourexperiments reported below confirmed these expectations

Figure 7 presents the results of the experiment that showsthat the size of cloaked regions increases as clients moveWe randomly chose trips varying in duration from 1 minuteto 30 minutes from the Cabspotter database We then fixedk = 4 and selected the smallest cloaked region containingat least k minus 1 other clients As Figure 7 shows the size ofthe cloaked region increases with the duration of the trip

To understand how this increase translates to queryresponseprocessing overhead we studied the size of the response(in kilobytes) received from the LBS for a fixed nearestneighbor query ldquoreturn the Chinese restaurant closest tomy current locationrdquo In spatial cloaking the anonymizersends the entire cloaked region to the LBS and must process2Although we have not carefully analyzed the privacy offered bythis scheme it suffices to compare the queryresponse performanceof spatial cloaking with that of SybilQuery

responses from the LBS to identify the closest Chineserestaurant for the client issuing the query However becauseSybilQuery sends exact street addresses the LBS sends theclosest Chinese restaurant Figure 8 compares the sizes ofthe query response for spatial cloaking and SybilQuery Asthis figure shows the query response size for SybilQueryis nearly a constant (at approximately 2KB) it only variesonly with the value of k However the response size forspatial cloaking increases linearly This is because the sizeof the cloaked region increases with trip length which inturn directly translates to an increased number of POIs thatmatch the userrsquos query

We also conducted an experiment to study the effect ofincreased POI densities on queryresponse performanceWe chose three representative nearest neighbor queries thatrequired the LBS to return the closest ldquorestaurantrdquo ldquoChineserestaurantrdquo and ldquoHunan Chinese restaurantrdquo to the clientrsquoslocation For spatial cloaking these queries representrespectively high medium and low POI densities AsFigure 9 shows with spatial cloaking the size of the queryresults depends both on POI density and on the duration ofthe trip and rises to about 5MB within 30 minutes for highPOI densities With SybilQuery the query response sizeremains fixed at 2KB (because only the closest restaurantis returned)

The experiments above used nearest-neighbour queries forwhich the difference in query responses for spatial cloakingand SybilQuery are most pronounced In contrast rangequeries such as ldquoreturn the list of all Chinese restaurants inan x-mile radiusrdquo require the LBS to process the query overan entire geographic region If a client issuing range queriesuses SybilQuery to protect his location the LBS mustprocess k geographic regions each of x-mile radius On theother hand if the client uses spatial cloaking the LBS mustonly process one geographic region whose radius is x mileslarger than the cloaked region We can therefore expectspatial cloaking to outperform SybilQuery as x increases

Figure 10 shows that this is indeed the case This figureplots the query response size against increasing values ofx for both SybilQuery and spatial cloaking For thisexperiment we assumed a circular cloaked region with afixed radius 5 kilometers and used SybilQuery with k = 4For smaller values of x SybilQuery outperforms spatial

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 3: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

EndpointGenerator

QueryGenerator

PathGenerator

Source amp Destination

Value of k

Original andk-1 synthetic

endpoints

Trafficstatistics

Regionalmaps

Path P andk-1 synthetic

paths

Client location l

Location l ampk-1 synthetic

locations

Response

Locationbasedservice

Figure 1 Basic design of SybilQuery

the system robust to future attacks SybilQuery howevercannot defend against adversarial LBSs that have targettedbackground information about a specific client If theLBS knows a particular clientrsquos daily commuting patternsor knows specific locations that the client visits it candifferentiate between real queries and Sybil queries sent bythat client with high probability For example if the LBSknows a clientrsquos residential address (R) and his work address(W ) it can identify the real path followed by the client bysearching for a path that starts at R and ends at W

HIDING CLIENT LOCATION USING SYBIL QUERIESThe SybilQuery system presents an interface akin to existingnavigation systems A user enters the address of the sourceand destination of a trip and the system outputs a path Pfor the user to follow A destination prediction schemesuch as the one proposed by Krumm and Horvitz [28]can be optionally used to alleviate the need for the userto enter the destination Additionally SybilQuery requiresas input a security parameter k that it uses to generatek minus 1 Sybil paths Each path is represented by the systemas a sequence of waypoints enroute from the source to thedestination As depicted in Figure 1 SybilQuery consistsof three modules an endpoint generator a path generatorand a query generator We describe each of these modules indetail below

The endpoint generator produces k minus 1 synthetic startand end points that statistically resemble the source anddestination input by the user so that the real trip doesnot stand out from the synthetic trips To do so theendpoint generator requires a database of regional trafficstatistics This database reflects historic traffic trendsat various geographic locations in the area (it is not adatabase of real-time traffic conditions) The high levelidea is that the endpoint generator processes this databaseto identify clusters of locations that share similar featuressuch as traffic density (details of specific features usedin our implementation are deferred to the next section)This process is a one-time activity and suffices to producesynthetic start and end points for multiple trips eg untilthe database is refreshed with newer statistics The endpointgenerator chooses synthetic start and end points from theclusters to which the actual source and destination belong Indoing so it also ensures that the Euclidean distance betweenthe source and destination of the actual path are within athreshold of the Euclidean distance between the syntheticstart and end points

The path generator uses the k start and end pointsincluding the actual source and destination to produce k

paths each represented as a sequence of waypoints Ingenerating paths it consults a database of regional maps (asdo existing on-board navigation systems) In fact the pathgenerator may be implemented using existing navigationsystems such as Google Maps Yahoo Maps and on-boardGPS devices because their interfaces are similar Both theendpoint generator and path generator need to be invokedonly once per trip

The query generator is triggered when the user queries theLBS with his current location ` Intuitively it simulates themotion of users along the k minus 1 Sybil paths and generatesk minus 1 Sybil locations In its simplest form the querygenerator simply computes the offset of ` from the source ofthe userrsquos path P and applies similar offsets to the sourcesof the Sybil paths to produce kminus1 Sybil locations Howeverit can also use current traffic conditions to more accuratelysimulate user movement along Sybil paths (eg simulateslower movement if traffic is congested at one of the Sybillocations)

The modular design of SybilQuery offers several advan-tages It lends itself to easy deployment even with legacydevices The endpoint generator is a standalone componentthat can possibly be installed as a user-level applicationeither on the userrsquos desktop or on his mobile deviceThe path generator can be implemented using off-the-shelfnavigation systems Using a navigation system enablesrobust path generation based on regional maps which allowsSybilQuery to generate Sybil paths that respect features suchas one-ways and road closures It also allows SybilQueryto robustly handle detours from the path P suggested bythe navigation system Upon a detour SybilQuery cansimply trigger the path generator to produce a new path tothe destination Because paths are generated independentof each other SybilQuery can also simulate detours inSybil paths SybilQueryrsquos design only requires the querygenerator to be modified to send k queries to the LBS insteadof one Based upon the deployment scenario even thismodification should be relatively easy For example if anend user employs a Web browser on his mobile phone tosend location-based queries to an LBS the query generatorcould be implemented as a browser extension

Several extensions to the basic design above can improve therobustness of the system to potential attacks

(a) Randomizing path selection Path generators im-plemented using navigation systems typically return theshortest route to the destination However real users maynot always follow the shortest path to the destination becauseof factors such as detours and road closures If SybilQueryalways produces shortest Sybil paths and the user choosesa longer path to the destination an adversary will be ableto differentiate the real path from Sybil paths with highprobabilityThis problem is addressed with path generators that cancompute multiple paths to the destination (each with varyinglengths) Instead of choosing the shortest path the pathgenerator instead uses a probability distribution (of thefrequency with which users choose paths other than the

shortest path) to make an appropriate choice from one of theavailable paths to the destination(b) Handling active adversaries An actively adversarialLBS may return doctored query responses as an attemptto differentiate Sybil paths from a clientrsquos real path Forexample suppose that an adversarial LBS falsely reportstraffic congestion at a particular location in response to a userquery In response a real user will likely take a detour If theSybilQuery system does not respond similarly to doctoredresponses (eg by checking for congestion and triggeringdetours in Sybil paths) an adversary could likely distinguishthe real path from Sybil pathsSybilQuery handles active adversaries usingN -variant queriesIn this technique SybilQuery queries multiple LBSs andcompares their responses in a manner akin to N -variantsystems [6] Unless all the LBSs queried collude with eachother adversarial LBSs are likely to be detected(c) Endpoint caching Suppose that a real path P fre-quented by the user (eg commuter paths such as hometo office) is associated with multiple sets of Sybil pathsAn LBS that observes paths over a period of time canstatistically identify P as the real path Similarly when auser travels to a new location shortly after arriving at thefirst destination if the second set of Sybil trips are randomlygenerated then the real paths can be easily distinguishedfrom the Sybil paths since the real paths share an endpointSybilQuery handles the above attacks by performing threetypes of caching (1) for the most common trips by the userthe Sybil endpoints are cached (2) if the user makes multipletrips from one common endpoint (eg his home or office)the corresponding Sybil endpoints are cached and (3) whenthe user embarks on a multi-destination trip the start pointsof the Sybil trips are cached so that the endpoint of a trip isthe same as the startpoint of the following trip(d) Providing path continuity Although the paths gen-erated by SybilQuery statistically resemble each other itis possible for the trip durations to differ (for both realand Sybil trips) If a real trip ends before some of theSybil trips end and the system stops sending queriesthe LBS can differentiate the real path from Sybil pathsSybilQuery guards against this by being an ldquoalways onrdquo toolthat continues to simulate movement along Sybil paths evenwhen the userrsquos real trip is complete This feature ensuressecurity even when the lengthsdurations of Sybil trips donot match those of real trips(e) Adding GPS sensor noise The location returned byGPS devices is typically inaccurate with the errors follow-ing a Gaussian distribution [7] If the Sybil locations alwaysfall on a road segment an adversary can differentiate themfrom locations returned by real GPS devices SybilQuerycan prevent this by adding a random noise to each Sybillocation

THE SybilQuery PROTOTYPEIn this section we describe the implementation of ourSybilQuery prototype The bulk of this section is devotedto the description of the endpoint generator which is imple-mented as a Python client that uses a PostgreSQL databasebackend with PostGIS spatial extensions to store regionaltraffic information The path generator is implemented asa client that queries an off-the-shelf service [35] to find the

set of waypoints corresponding to the shortest path givena startpoint and an endpoint Finally the query generatorsimulates user movement along all the k paths

The Endpoint GeneratorAs discussed in the previous section the endpoint generatoruses the source and destination addresses input by the userto generate k minus 1 synthetic endpoints To ensure thatthese synthetic endpoints statistically resemble the actualendpoints of the userrsquos trip the endpoint generator usesa regional traffic database The endpoint generator firstpreprocesses the database to identify key features of eachgeographic location

Regional Traffic Database For our prototype we useda month-long (August 25 2008ndashSeptember 24 2008) setof GPS traces from the Cabspotting project [2] as regionaltraffic database The Cabspotting project tracks the mobilityof cabs in the San Francisco Bay area1 Each cab thatparticipated in this project was outfit with a GPS device thatupdated its location with a server each minute Each of theseupdates was a quadruplelttimestamp cab ID current location flaggtHere flag indicates whether the cab was metered (ie ina trip) or empty We used flag to convert the GPS tracesinto trips (ie to demarcate sources and destinations duringwhich the cab was metered) Overall our database containsa total of 529533 trips by 530 unique cabs we observed atleast 447 cabs on any given day Although we tailor ourdiscussion in the rest of this section to this database weemphasize that the techniques employed by SybilQuery areapplicable to any traffic database

Preprocessing the Traffic Database An ideal implemen-tation of the endpoint generator would use an annotateddatabase of the local region to identify synthetic endpointsthat resemble the endpoints input by the user Theannotated database would contain descriptive tags for eachgeographic location such as ldquoparking lotrdquo ldquodowntownoffice buildingrdquo or ldquofreewayrdquo The endpoint generatorwould output synthetic sources and destinations whose tagsmatch the corresponding tags of the source and destinationsupplied by the user However such annotated databasesare laborious to create and even so are unlikely to becomprehensive or contain tags suitable for LBSs in differentapplication domains

The preprocessing step addresses this problem by automati-cally computing a feature set that describes each geographiclocation using information from the traffic database Theseautomatically extracted features then serve as tags that theendpoint generator can use to find synthetic sources anddestinations that statistically resemble the userrsquos input

Our implementation uses traffic density as the feature thatcharacterizes each geographic location We define the trafficdensity τ` of a geographic location ` as the number of tripsthat start end or traverse through ` in a fixed interval of time1Consequently our experiments also focus on trips in thisgeographic region However SybilQuery can automaticallygenerate Sybil queries for any geographic region if provided with asimilar traffic database for that region

(we describe in detail below our how our system representsgeographic locations for simplicity it suffices to think ofthem as fixed-size rectangular regions)We calculated τ`for each location ` using the PostGIS spatial extension ofPostgreSQL Because τ` can acquire a large number ofdiscrete values we used a simple clustering algorithm togroup ldquosimilarrdquo values of τ` into a single cluster Althoughany clustering technique may be used we found that evensimple clustering algorithms such as separating values of τ`into buckets serve our purpose well We empirically verifiedthat these clusters of the San Francisco Bay area sharesimilar semantic properties For example different shoppingcenters were grouped together into the same cluster as wereresidential areas

In addition to computing the traffic density of each geo-graphic location ` the preprocessing step also computes aprobability distribution function (PDF) π` of the length ofa trip that originates at ` In particular it computes thefollowing quantity for each location ` for all values of len

π`(len) = trips of length len that start in `

trips that start in `

This PDF is used by the endpoint generation algorithm tochoose appropriate geographic locations as sources for Sybilpaths

Temporal traffic patterns Traffic densities at a geographiclocation ` vary depending on the hour of the day and the dayof the week which results in temporal patterns in the valuesof τ` To reflect these temporal patterns in the features ofeach geographic location we define six kinds of temporalstates based on the patterns in our dataset namely ldquopeakintervalweekdayrdquo (7pm-11pm) ldquonormal intervalweekdayrdquo(6am-7pm and 11pm-3am) ldquooff-peak intervalweekdayrdquo(3am-6am) ldquopeak intervalweekendrdquo (7am-11pm) ldquonormalintervalweekendrdquo (6am-7pm and 11pm-3am) and ldquooff-peakintervalweekendrdquo (3am-6am) The precomputation stepcomputes six values of τ` for each location correspondingto each of the temporal states above When the user suppliesa source and a destination the endpoint generator choosesthe value of τ` to compute synthetic endpoints based uponthe timestamp in the userrsquos input

Representing geographic locations To compute τ` thepreprocessing algorithm needs an appropriate data structureto represent the location ` This data structure mustsimultaneously balance precision and scalability ie itmust store precise values of τ` for each location ` andmust readily scale to large geographic regions such as theSan Francisco Bay area To better illustrate this tradeoffsuppose that geographic locations are represented usingfixed-size rectangular blocks as was the case in an earlyimplementation of our prototype We found that a block sizeof 400mtimes400m resulted in a manageable number of blocksfor the entire Bay area (about 100000 blocks) but did notaccurately represent traffic densities in crowded areas suchas the downtown region and the airport On the other handa block size of 25mtimes25m accurately represented trafficdensities but resulted in a large number of blocks (about64 million blocks)

(a) San Francisco Bay areaThe airport appears on thelower right corner of thisfigure

(b) San Francisco airportBlack blocks have high traf-fic densities

Figure 2 Quadtree rep-resentations of geographiclocations

SybilQuery therefore usesan adaptive data structurethe Quadtree [12] to repre-sent traffic densities Thisdata structure represents ge-ographic locations using blocksof varying sizes (smallestsize is 25mtimes25m) depend-ing on the traffic density ofthat location In particu-lar geographic regions withnon-uniform distribution oftraffic are represented usingsmall block sizes while re-gions with a uniform dis-tribution of traffic are rep-resented using larger blocksizes Using the Quadtreedata structure we were ableto represent traffic densitiesfor the entire San FranciscoBay area using just 16000blocks

Output of preprocessingFigure 2 pictorially repre-sents the output of the pre-processing step Figure 2(a) shows the quadtree represen-tation of the entire San Francisco Bay area The adaptivenature of the Quadtree data structure results in differentblock sizes for different geographic locations Althougheach of these blocks has uniform traffic density in ourexperience areas with high traffic density tend to berepresented with smaller block sizes This effect is mostpronounced in areas with dense traffic such as the airportdepicted in Figure 2(b) Note that the regions with highesttraffic density such as the freeway and the pickup area infront of the airport are represented using small blocks whileareas inside the airport that have no traffic are representedusing larger block sizes Not all blocks have the sametraffic density in Figure 2(b) blocks with the highest trafficdensities are filled in black

Generating Sybil Endpoints When provided with a sourceand a destination input by the user the endpoint generatorcomputes a Sybil source destination pair as shown inAlgorithm 1 (our prototype adapts the same algorithm togenerate k minus 1 Sybil source destination pairs for brevitywe only illustrate the algorithm for one pair) It first finds thegeographic locations (in the Quadtree representation) `srcand `dst that contain the source and destination addressesinput by the user Next it computes the set of all sourcelocations ` that satisfy two conditions (1) the traffic densityof ` is approximately that of `src and (2) the probability ofa trip of length dist originating from ` matches that of `srcThe key intuition behind this step is to find the set of sourcelocations that closely resemble the source location input bythe user It randomly chooses a source location `srcprime fromthis set of locations and finds a destination location `dstprime

within a radius dist of `srcprime whose traffic density matchesthat of `dst The algorithm generates end points that are ofsimilar lengths but not exactly the same length as the real

Algorithm Generate Sybil Endpoints(src dst)Input (a) src Street address of userrsquos source

(b) dst Street address of userrsquos destinationOutput (srcprime dstprime) A Sybil sourcedestination pairdist = Euclidean distance between src and dst1`src `dst = geographic locations that contain src dst2Sources = Set of all locations ` with (i) τ` asymp τ`src and (ii) π`(dist)asymp π`src (dist)3`srcprime = random location isin Sources4Destinations = Set of all locations ` with (i) τ` asymp τ`dst and (ii) Euclidean distance5between `srcprime and `asymp dist`dstprime = random location isin Destinations6srcprime dstprime = Reverse geocode random points in `srcprime `dstprime 7return (srcprime dstprime)8

Algorithm 1 Generating Sybil endpoints

client path This is to allow for the client path or the Sybilpaths to make occasional detours in the path generation step

(a) Random point in geo-graphic location

(b) Street address closest torandom point via reversegeocoding

Figure 3 Finding a pointin a block using reversegeocoding

The last step is to identifyactual addresses within theSybil sourcedestination lo-cations just identified Todo so it randomly chooses apoint within the geographiclocation However thispoint may reside in non-driveable terrain as shownin Figure 3(a) If thispoint were returned as asourcedestination an adver-sary can easily identify thispoint as a Sybil address Toavoid such cases we reversegeocode this point (usingan API provided by GoogleMaps [18]) to find the streetaddress closest to the pointidentified Doing so greatlyimproves the quality of theendpoints generated by ouralgorithm Figure 3(b)shows the closest street ad-dress of the point identifiedin Figure 3(a)

The Path GeneratorThe SybilQuery prototype implements the path generatorusing the Microsoft Multimap API [35] When supplied witha source and a destination this API produces a sequenceof waypoints representing the path to the destination Asnoted earlier SybilQuery admits the use of any off-the-shelfpath generator such as those used by on-board navigationsystems This feature helps SybilQuery produce paths thatautomatically account for environmental factors such asroad closures and one ways

The Multimap API currently only produces the shortest pathto the destination thus making our prototype vulnerable toattack if the user follows a longer path to the destinationHowever as noted earlier this problem can easily be fixedwith the use of an API that generates a choice of paths tothe destination (or an API that allows the user to specify awaypoint that must be included enroute the destination) The

(a) Real trip The valuesof τ` for the source anddestination were 22 and 247respectively

(b) One of the Sybiltrips The values of τ`

for the source and des-tination were 22 and255 respectively

Figure 4 A real user trip and a Sybil trip generated bySybilQuery

path generation algorithm could then choose paths (both realand Sybil paths) based upon a probability distribution of thelength of path that users typically follow to their destination

The Query GeneratorThe basic version of SybilQueryrsquos query generator simulatesuser movement along each path It simulates the movementof users along Sybil paths at approximately the projectedspeed along that path (Information about the average speedalong a path is obtained is returned by most off-the-shelfpath generators) When the user sends a query to theLBS the query generator obtains the current location of thesimulated users and sends these as Sybil queries to the LBSWe have also interfaced the query generator with the Yahoolocal API [44] to more accurately simulate movement alongSybil paths under the constraints of current traffic Thus ifthere is traffic congestion at a particular location SybilQuerycan either simulate slower movement in that location orsimulate a detour (and request the path generator to generatea fresh path to the destination) Note that Sybil paths getcongested and decongested independently from real pathsWhen querying an LBS for online information the queryis made for all the Sybil locations as well as the real clientlocation so that the uncertainty at the LBS about the clientlocation remains the same Although querying an LBSthat reports current traffic conditions renders the prototypevulnerable to malicious LBSs that report false traffic datathe query generator can use the N -variant queries approachdescribed earlier to counter this threat

ExampleFigure 4 presents SybilQuery in action It shows a realuser trip (Figure 4(a)) obtained from the Cabspotter tracesfor another month and one of the Sybil trips generated bySybilQuery (Figure 4(b)) k was set to 4 for this exampleThe real trip (at approximately 6pm) originates at a homein Daly City and ends at a shopping area in downtown SanFrancisco All the Sybil trips generated by SybilQuery alsostarted in residential areas with similar traffic densities asthe source of the real trip and ended in a crowded region ofSan Francisco downtown Figure 4 also reports the trafficdensities for the sources and destinations for the real andSybil trip generated by SybilQuery A key point to notefrom this example is that traffic densities accurately capture

k questions correct Probability4 75 20 0266 75 14 019

Figure 5 Results of a user study with 15 volunteers

semantic properties (eg ldquoresidential areardquo ldquodowntownrdquoldquoshopping areardquo) of geographic regions

EVALUATIONIn this section we present the results of our evaluation ofthe SybilQuery prototype We evaluated both the privacyand the performance offered by SybilQuery

PrivacyTo evaluate the quality of the Sybil paths generated by oursystem we conducted a user study The high-level ideabehind the user study was to give the working system toadversarial users who would try to break the system bytrying to find the real user path hidden between Sybil pathsOur methodology was as followsmdashwe picked real pathsat random from the Cabspotter traces (different from themonth-long traces that were used to seed our prototypersquosendpoint generator) and used SybilQuery to generate Sybilpaths with different values of k We also performed aquantitative evaluation of privacy by devising a new metricThe intuition behind this metric is that an adversary who hasaccess to a database R of real paths and a database S ofSybil paths corresponding to paths in R should be unableto differentiate between the two databases We presentthe results from this evaluation in our detailed technicalreport [40]

User Study We conducted the user study with 15volunteers each of whom had to answer ten questions Eachquestion presented k paths (one real path and k minus 1 Sybilpaths) to the volunteer who had to identify the real pathfrom the Sybil paths Five questions had k = 4 while theremaining five had k = 6 Each volunteer was presentedwith a different set of ten questions ie our study obtainedresponses for 150 questions in all Volunteers could viewpaths using the Google Maps API and were instructed touse any tools at their disposal such as zooming into the mapstreet views and local information about the San FranciscoBay area in their attempt to distinguish real paths from Sybilpaths All our volunteers were computer science graduatesfamiliar with the basic geography of the San Francisco Bayarea

Figure 5 presents the results of this study and shows thenumber of questions for which users were able to correctlyidentify the real path from the Sybil paths As this figureindicates volunteers were able to correctly guess the realpath with a probability of 026 for k = 4 and 019 fork = 6 These probabilities are close to their expectedvalues (025 and 017 respectively) We further calculatedthe per-user standard deviation of the probability of correctguesses and found these values to be quite low (019 and015 respectively) These results lead us to conclude thatthe Sybil paths qualitatively resemble real paths

Figure 6 Queryresponselatency of SybilQuery

Figure 7 Spatial cloakingIncrease of cloaked regionradius with trip length

Sample responses from the user study gave us an interestingperspective into the mind of an adversary trying to breakthe SybilQuery system Several users selected a false tripas the ldquoodd man outrdquo since all the other trips (includingthe real trip) were similar Different users used contrastingrationale to guess real user trips For example some userswrongly selected shortest paths since the real path ldquolookedtoo circuitous to be truerdquo Yet others wrongly selectedcircuitous paths as they felt that ldquooutside human knowledgecould be directing this routerdquo Users typically found it hardto guess when the trip started late (ieif the first querywas made after travelling some distance) since ldquothe startingpoint seemed a bit oddrdquo Sybil paths with a prominentendpoint such as a University or a Metro station were oftenselected by users with responses such as ldquodestination isUniversity (different from others) and more likely to be thereal userrsquos queryrdquo

To summarize the user study the reasonings used by theadversaries were extremely diverse which seems to explainwhy the overall probability of guessing correctly was closeto that of random guessing

PerformanceWe report the performance of several aspects of SybilQueryincluding the time needed to preprocess the traffic databaseand the real-time performance of querying an LBS Wealso compare the performance of SybilQuery against animplementation of spatial cloaking All the results presentedbelow were obtained on a 173GHz Pentium M laptop with512 MB RAM Each experiment was repeated 50 times andthe value of the privacy parameter k was fixed to be 4 unlessotherwise indicated

One-Time and Once-Per-Trip Costs The one-time cost(offline step) for SybilQuery involves preprocessing of thetraffic database This step processed 529533 trips and tookapproximately 2 hours and 16 minutes

The once-per-trip costs for SybilQuery involve end-pointgeneration and path generation Generating a single pair ofendpoints at the beginning of a user trip took an average of547 seconds with a standard deviation of 102 seconds Tomeasure the cost of generating paths we randomly chosetrips with lengths varying between 5 and 30 minutes Themean cost of computing a path (including the networklatency to query the Microsoft MultiMap API) was 360milliseconds with a standard deviation of 150 milliseconds

Figure 8 Response size returned by Sybil-Query and spatial cloaking for mediumPOI density

Figure 9 Response size returned by spa-tial cloaking for different POI densities

Figure 10 Comparing the performance ofSybilQuery with spatial cloaking for rangequeries

QueryResponse Performance We measured the responselatency of SybilQuery by integrating it with the YahooMaps local search API [43] Figure 6 shows the latency ofsending k queries and receiving responses As expected thecost increases linearly with k since k requests are sent to theLBS each time the client makes a query

Comparison with Spatial Cloaking Spatial cloaking [19]is a state of the art technique that achieves k-anonymityby using an anonymizer to send the LBS cloaked regionscontaining at least k users The LBS returns all pointsof interest (POIs) within the cloaked region that match thequery to the anonymizer Although spatial cloaking wasoriginally proposed for static users we considered a simplevariant for mobile users in which cloaked regions grow asusers move2

We conducted experiments to compare spatial cloaking withSybilQuery Because spatial cloaking uses an anonymizerto send cloaked regions containing k clients to the LBS wecan expect the cloaked regions to increase in size as clientsmove (because the cloaked region must contain the same setof clients to preserve k-anonymity) In addition becausean LBS returns query results for the cloaked region wecan expect that increased POI density will lead to increasedqueryresponse processing time at the anonymizer Ourexperiments reported below confirmed these expectations

Figure 7 presents the results of the experiment that showsthat the size of cloaked regions increases as clients moveWe randomly chose trips varying in duration from 1 minuteto 30 minutes from the Cabspotter database We then fixedk = 4 and selected the smallest cloaked region containingat least k minus 1 other clients As Figure 7 shows the size ofthe cloaked region increases with the duration of the trip

To understand how this increase translates to queryresponseprocessing overhead we studied the size of the response(in kilobytes) received from the LBS for a fixed nearestneighbor query ldquoreturn the Chinese restaurant closest tomy current locationrdquo In spatial cloaking the anonymizersends the entire cloaked region to the LBS and must process2Although we have not carefully analyzed the privacy offered bythis scheme it suffices to compare the queryresponse performanceof spatial cloaking with that of SybilQuery

responses from the LBS to identify the closest Chineserestaurant for the client issuing the query However becauseSybilQuery sends exact street addresses the LBS sends theclosest Chinese restaurant Figure 8 compares the sizes ofthe query response for spatial cloaking and SybilQuery Asthis figure shows the query response size for SybilQueryis nearly a constant (at approximately 2KB) it only variesonly with the value of k However the response size forspatial cloaking increases linearly This is because the sizeof the cloaked region increases with trip length which inturn directly translates to an increased number of POIs thatmatch the userrsquos query

We also conducted an experiment to study the effect ofincreased POI densities on queryresponse performanceWe chose three representative nearest neighbor queries thatrequired the LBS to return the closest ldquorestaurantrdquo ldquoChineserestaurantrdquo and ldquoHunan Chinese restaurantrdquo to the clientrsquoslocation For spatial cloaking these queries representrespectively high medium and low POI densities AsFigure 9 shows with spatial cloaking the size of the queryresults depends both on POI density and on the duration ofthe trip and rises to about 5MB within 30 minutes for highPOI densities With SybilQuery the query response sizeremains fixed at 2KB (because only the closest restaurantis returned)

The experiments above used nearest-neighbour queries forwhich the difference in query responses for spatial cloakingand SybilQuery are most pronounced In contrast rangequeries such as ldquoreturn the list of all Chinese restaurants inan x-mile radiusrdquo require the LBS to process the query overan entire geographic region If a client issuing range queriesuses SybilQuery to protect his location the LBS mustprocess k geographic regions each of x-mile radius On theother hand if the client uses spatial cloaking the LBS mustonly process one geographic region whose radius is x mileslarger than the cloaked region We can therefore expectspatial cloaking to outperform SybilQuery as x increases

Figure 10 shows that this is indeed the case This figureplots the query response size against increasing values ofx for both SybilQuery and spatial cloaking For thisexperiment we assumed a circular cloaked region with afixed radius 5 kilometers and used SybilQuery with k = 4For smaller values of x SybilQuery outperforms spatial

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 4: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

shortest path) to make an appropriate choice from one of theavailable paths to the destination(b) Handling active adversaries An actively adversarialLBS may return doctored query responses as an attemptto differentiate Sybil paths from a clientrsquos real path Forexample suppose that an adversarial LBS falsely reportstraffic congestion at a particular location in response to a userquery In response a real user will likely take a detour If theSybilQuery system does not respond similarly to doctoredresponses (eg by checking for congestion and triggeringdetours in Sybil paths) an adversary could likely distinguishthe real path from Sybil pathsSybilQuery handles active adversaries usingN -variant queriesIn this technique SybilQuery queries multiple LBSs andcompares their responses in a manner akin to N -variantsystems [6] Unless all the LBSs queried collude with eachother adversarial LBSs are likely to be detected(c) Endpoint caching Suppose that a real path P fre-quented by the user (eg commuter paths such as hometo office) is associated with multiple sets of Sybil pathsAn LBS that observes paths over a period of time canstatistically identify P as the real path Similarly when auser travels to a new location shortly after arriving at thefirst destination if the second set of Sybil trips are randomlygenerated then the real paths can be easily distinguishedfrom the Sybil paths since the real paths share an endpointSybilQuery handles the above attacks by performing threetypes of caching (1) for the most common trips by the userthe Sybil endpoints are cached (2) if the user makes multipletrips from one common endpoint (eg his home or office)the corresponding Sybil endpoints are cached and (3) whenthe user embarks on a multi-destination trip the start pointsof the Sybil trips are cached so that the endpoint of a trip isthe same as the startpoint of the following trip(d) Providing path continuity Although the paths gen-erated by SybilQuery statistically resemble each other itis possible for the trip durations to differ (for both realand Sybil trips) If a real trip ends before some of theSybil trips end and the system stops sending queriesthe LBS can differentiate the real path from Sybil pathsSybilQuery guards against this by being an ldquoalways onrdquo toolthat continues to simulate movement along Sybil paths evenwhen the userrsquos real trip is complete This feature ensuressecurity even when the lengthsdurations of Sybil trips donot match those of real trips(e) Adding GPS sensor noise The location returned byGPS devices is typically inaccurate with the errors follow-ing a Gaussian distribution [7] If the Sybil locations alwaysfall on a road segment an adversary can differentiate themfrom locations returned by real GPS devices SybilQuerycan prevent this by adding a random noise to each Sybillocation

THE SybilQuery PROTOTYPEIn this section we describe the implementation of ourSybilQuery prototype The bulk of this section is devotedto the description of the endpoint generator which is imple-mented as a Python client that uses a PostgreSQL databasebackend with PostGIS spatial extensions to store regionaltraffic information The path generator is implemented asa client that queries an off-the-shelf service [35] to find the

set of waypoints corresponding to the shortest path givena startpoint and an endpoint Finally the query generatorsimulates user movement along all the k paths

The Endpoint GeneratorAs discussed in the previous section the endpoint generatoruses the source and destination addresses input by the userto generate k minus 1 synthetic endpoints To ensure thatthese synthetic endpoints statistically resemble the actualendpoints of the userrsquos trip the endpoint generator usesa regional traffic database The endpoint generator firstpreprocesses the database to identify key features of eachgeographic location

Regional Traffic Database For our prototype we useda month-long (August 25 2008ndashSeptember 24 2008) setof GPS traces from the Cabspotting project [2] as regionaltraffic database The Cabspotting project tracks the mobilityof cabs in the San Francisco Bay area1 Each cab thatparticipated in this project was outfit with a GPS device thatupdated its location with a server each minute Each of theseupdates was a quadruplelttimestamp cab ID current location flaggtHere flag indicates whether the cab was metered (ie ina trip) or empty We used flag to convert the GPS tracesinto trips (ie to demarcate sources and destinations duringwhich the cab was metered) Overall our database containsa total of 529533 trips by 530 unique cabs we observed atleast 447 cabs on any given day Although we tailor ourdiscussion in the rest of this section to this database weemphasize that the techniques employed by SybilQuery areapplicable to any traffic database

Preprocessing the Traffic Database An ideal implemen-tation of the endpoint generator would use an annotateddatabase of the local region to identify synthetic endpointsthat resemble the endpoints input by the user Theannotated database would contain descriptive tags for eachgeographic location such as ldquoparking lotrdquo ldquodowntownoffice buildingrdquo or ldquofreewayrdquo The endpoint generatorwould output synthetic sources and destinations whose tagsmatch the corresponding tags of the source and destinationsupplied by the user However such annotated databasesare laborious to create and even so are unlikely to becomprehensive or contain tags suitable for LBSs in differentapplication domains

The preprocessing step addresses this problem by automati-cally computing a feature set that describes each geographiclocation using information from the traffic database Theseautomatically extracted features then serve as tags that theendpoint generator can use to find synthetic sources anddestinations that statistically resemble the userrsquos input

Our implementation uses traffic density as the feature thatcharacterizes each geographic location We define the trafficdensity τ` of a geographic location ` as the number of tripsthat start end or traverse through ` in a fixed interval of time1Consequently our experiments also focus on trips in thisgeographic region However SybilQuery can automaticallygenerate Sybil queries for any geographic region if provided with asimilar traffic database for that region

(we describe in detail below our how our system representsgeographic locations for simplicity it suffices to think ofthem as fixed-size rectangular regions)We calculated τ`for each location ` using the PostGIS spatial extension ofPostgreSQL Because τ` can acquire a large number ofdiscrete values we used a simple clustering algorithm togroup ldquosimilarrdquo values of τ` into a single cluster Althoughany clustering technique may be used we found that evensimple clustering algorithms such as separating values of τ`into buckets serve our purpose well We empirically verifiedthat these clusters of the San Francisco Bay area sharesimilar semantic properties For example different shoppingcenters were grouped together into the same cluster as wereresidential areas

In addition to computing the traffic density of each geo-graphic location ` the preprocessing step also computes aprobability distribution function (PDF) π` of the length ofa trip that originates at ` In particular it computes thefollowing quantity for each location ` for all values of len

π`(len) = trips of length len that start in `

trips that start in `

This PDF is used by the endpoint generation algorithm tochoose appropriate geographic locations as sources for Sybilpaths

Temporal traffic patterns Traffic densities at a geographiclocation ` vary depending on the hour of the day and the dayof the week which results in temporal patterns in the valuesof τ` To reflect these temporal patterns in the features ofeach geographic location we define six kinds of temporalstates based on the patterns in our dataset namely ldquopeakintervalweekdayrdquo (7pm-11pm) ldquonormal intervalweekdayrdquo(6am-7pm and 11pm-3am) ldquooff-peak intervalweekdayrdquo(3am-6am) ldquopeak intervalweekendrdquo (7am-11pm) ldquonormalintervalweekendrdquo (6am-7pm and 11pm-3am) and ldquooff-peakintervalweekendrdquo (3am-6am) The precomputation stepcomputes six values of τ` for each location correspondingto each of the temporal states above When the user suppliesa source and a destination the endpoint generator choosesthe value of τ` to compute synthetic endpoints based uponthe timestamp in the userrsquos input

Representing geographic locations To compute τ` thepreprocessing algorithm needs an appropriate data structureto represent the location ` This data structure mustsimultaneously balance precision and scalability ie itmust store precise values of τ` for each location ` andmust readily scale to large geographic regions such as theSan Francisco Bay area To better illustrate this tradeoffsuppose that geographic locations are represented usingfixed-size rectangular blocks as was the case in an earlyimplementation of our prototype We found that a block sizeof 400mtimes400m resulted in a manageable number of blocksfor the entire Bay area (about 100000 blocks) but did notaccurately represent traffic densities in crowded areas suchas the downtown region and the airport On the other handa block size of 25mtimes25m accurately represented trafficdensities but resulted in a large number of blocks (about64 million blocks)

(a) San Francisco Bay areaThe airport appears on thelower right corner of thisfigure

(b) San Francisco airportBlack blocks have high traf-fic densities

Figure 2 Quadtree rep-resentations of geographiclocations

SybilQuery therefore usesan adaptive data structurethe Quadtree [12] to repre-sent traffic densities Thisdata structure represents ge-ographic locations using blocksof varying sizes (smallestsize is 25mtimes25m) depend-ing on the traffic density ofthat location In particu-lar geographic regions withnon-uniform distribution oftraffic are represented usingsmall block sizes while re-gions with a uniform dis-tribution of traffic are rep-resented using larger blocksizes Using the Quadtreedata structure we were ableto represent traffic densitiesfor the entire San FranciscoBay area using just 16000blocks

Output of preprocessingFigure 2 pictorially repre-sents the output of the pre-processing step Figure 2(a) shows the quadtree represen-tation of the entire San Francisco Bay area The adaptivenature of the Quadtree data structure results in differentblock sizes for different geographic locations Althougheach of these blocks has uniform traffic density in ourexperience areas with high traffic density tend to berepresented with smaller block sizes This effect is mostpronounced in areas with dense traffic such as the airportdepicted in Figure 2(b) Note that the regions with highesttraffic density such as the freeway and the pickup area infront of the airport are represented using small blocks whileareas inside the airport that have no traffic are representedusing larger block sizes Not all blocks have the sametraffic density in Figure 2(b) blocks with the highest trafficdensities are filled in black

Generating Sybil Endpoints When provided with a sourceand a destination input by the user the endpoint generatorcomputes a Sybil source destination pair as shown inAlgorithm 1 (our prototype adapts the same algorithm togenerate k minus 1 Sybil source destination pairs for brevitywe only illustrate the algorithm for one pair) It first finds thegeographic locations (in the Quadtree representation) `srcand `dst that contain the source and destination addressesinput by the user Next it computes the set of all sourcelocations ` that satisfy two conditions (1) the traffic densityof ` is approximately that of `src and (2) the probability ofa trip of length dist originating from ` matches that of `srcThe key intuition behind this step is to find the set of sourcelocations that closely resemble the source location input bythe user It randomly chooses a source location `srcprime fromthis set of locations and finds a destination location `dstprime

within a radius dist of `srcprime whose traffic density matchesthat of `dst The algorithm generates end points that are ofsimilar lengths but not exactly the same length as the real

Algorithm Generate Sybil Endpoints(src dst)Input (a) src Street address of userrsquos source

(b) dst Street address of userrsquos destinationOutput (srcprime dstprime) A Sybil sourcedestination pairdist = Euclidean distance between src and dst1`src `dst = geographic locations that contain src dst2Sources = Set of all locations ` with (i) τ` asymp τ`src and (ii) π`(dist)asymp π`src (dist)3`srcprime = random location isin Sources4Destinations = Set of all locations ` with (i) τ` asymp τ`dst and (ii) Euclidean distance5between `srcprime and `asymp dist`dstprime = random location isin Destinations6srcprime dstprime = Reverse geocode random points in `srcprime `dstprime 7return (srcprime dstprime)8

Algorithm 1 Generating Sybil endpoints

client path This is to allow for the client path or the Sybilpaths to make occasional detours in the path generation step

(a) Random point in geo-graphic location

(b) Street address closest torandom point via reversegeocoding

Figure 3 Finding a pointin a block using reversegeocoding

The last step is to identifyactual addresses within theSybil sourcedestination lo-cations just identified Todo so it randomly chooses apoint within the geographiclocation However thispoint may reside in non-driveable terrain as shownin Figure 3(a) If thispoint were returned as asourcedestination an adver-sary can easily identify thispoint as a Sybil address Toavoid such cases we reversegeocode this point (usingan API provided by GoogleMaps [18]) to find the streetaddress closest to the pointidentified Doing so greatlyimproves the quality of theendpoints generated by ouralgorithm Figure 3(b)shows the closest street ad-dress of the point identifiedin Figure 3(a)

The Path GeneratorThe SybilQuery prototype implements the path generatorusing the Microsoft Multimap API [35] When supplied witha source and a destination this API produces a sequenceof waypoints representing the path to the destination Asnoted earlier SybilQuery admits the use of any off-the-shelfpath generator such as those used by on-board navigationsystems This feature helps SybilQuery produce paths thatautomatically account for environmental factors such asroad closures and one ways

The Multimap API currently only produces the shortest pathto the destination thus making our prototype vulnerable toattack if the user follows a longer path to the destinationHowever as noted earlier this problem can easily be fixedwith the use of an API that generates a choice of paths tothe destination (or an API that allows the user to specify awaypoint that must be included enroute the destination) The

(a) Real trip The valuesof τ` for the source anddestination were 22 and 247respectively

(b) One of the Sybiltrips The values of τ`

for the source and des-tination were 22 and255 respectively

Figure 4 A real user trip and a Sybil trip generated bySybilQuery

path generation algorithm could then choose paths (both realand Sybil paths) based upon a probability distribution of thelength of path that users typically follow to their destination

The Query GeneratorThe basic version of SybilQueryrsquos query generator simulatesuser movement along each path It simulates the movementof users along Sybil paths at approximately the projectedspeed along that path (Information about the average speedalong a path is obtained is returned by most off-the-shelfpath generators) When the user sends a query to theLBS the query generator obtains the current location of thesimulated users and sends these as Sybil queries to the LBSWe have also interfaced the query generator with the Yahoolocal API [44] to more accurately simulate movement alongSybil paths under the constraints of current traffic Thus ifthere is traffic congestion at a particular location SybilQuerycan either simulate slower movement in that location orsimulate a detour (and request the path generator to generatea fresh path to the destination) Note that Sybil paths getcongested and decongested independently from real pathsWhen querying an LBS for online information the queryis made for all the Sybil locations as well as the real clientlocation so that the uncertainty at the LBS about the clientlocation remains the same Although querying an LBSthat reports current traffic conditions renders the prototypevulnerable to malicious LBSs that report false traffic datathe query generator can use the N -variant queries approachdescribed earlier to counter this threat

ExampleFigure 4 presents SybilQuery in action It shows a realuser trip (Figure 4(a)) obtained from the Cabspotter tracesfor another month and one of the Sybil trips generated bySybilQuery (Figure 4(b)) k was set to 4 for this exampleThe real trip (at approximately 6pm) originates at a homein Daly City and ends at a shopping area in downtown SanFrancisco All the Sybil trips generated by SybilQuery alsostarted in residential areas with similar traffic densities asthe source of the real trip and ended in a crowded region ofSan Francisco downtown Figure 4 also reports the trafficdensities for the sources and destinations for the real andSybil trip generated by SybilQuery A key point to notefrom this example is that traffic densities accurately capture

k questions correct Probability4 75 20 0266 75 14 019

Figure 5 Results of a user study with 15 volunteers

semantic properties (eg ldquoresidential areardquo ldquodowntownrdquoldquoshopping areardquo) of geographic regions

EVALUATIONIn this section we present the results of our evaluation ofthe SybilQuery prototype We evaluated both the privacyand the performance offered by SybilQuery

PrivacyTo evaluate the quality of the Sybil paths generated by oursystem we conducted a user study The high-level ideabehind the user study was to give the working system toadversarial users who would try to break the system bytrying to find the real user path hidden between Sybil pathsOur methodology was as followsmdashwe picked real pathsat random from the Cabspotter traces (different from themonth-long traces that were used to seed our prototypersquosendpoint generator) and used SybilQuery to generate Sybilpaths with different values of k We also performed aquantitative evaluation of privacy by devising a new metricThe intuition behind this metric is that an adversary who hasaccess to a database R of real paths and a database S ofSybil paths corresponding to paths in R should be unableto differentiate between the two databases We presentthe results from this evaluation in our detailed technicalreport [40]

User Study We conducted the user study with 15volunteers each of whom had to answer ten questions Eachquestion presented k paths (one real path and k minus 1 Sybilpaths) to the volunteer who had to identify the real pathfrom the Sybil paths Five questions had k = 4 while theremaining five had k = 6 Each volunteer was presentedwith a different set of ten questions ie our study obtainedresponses for 150 questions in all Volunteers could viewpaths using the Google Maps API and were instructed touse any tools at their disposal such as zooming into the mapstreet views and local information about the San FranciscoBay area in their attempt to distinguish real paths from Sybilpaths All our volunteers were computer science graduatesfamiliar with the basic geography of the San Francisco Bayarea

Figure 5 presents the results of this study and shows thenumber of questions for which users were able to correctlyidentify the real path from the Sybil paths As this figureindicates volunteers were able to correctly guess the realpath with a probability of 026 for k = 4 and 019 fork = 6 These probabilities are close to their expectedvalues (025 and 017 respectively) We further calculatedthe per-user standard deviation of the probability of correctguesses and found these values to be quite low (019 and015 respectively) These results lead us to conclude thatthe Sybil paths qualitatively resemble real paths

Figure 6 Queryresponselatency of SybilQuery

Figure 7 Spatial cloakingIncrease of cloaked regionradius with trip length

Sample responses from the user study gave us an interestingperspective into the mind of an adversary trying to breakthe SybilQuery system Several users selected a false tripas the ldquoodd man outrdquo since all the other trips (includingthe real trip) were similar Different users used contrastingrationale to guess real user trips For example some userswrongly selected shortest paths since the real path ldquolookedtoo circuitous to be truerdquo Yet others wrongly selectedcircuitous paths as they felt that ldquooutside human knowledgecould be directing this routerdquo Users typically found it hardto guess when the trip started late (ieif the first querywas made after travelling some distance) since ldquothe startingpoint seemed a bit oddrdquo Sybil paths with a prominentendpoint such as a University or a Metro station were oftenselected by users with responses such as ldquodestination isUniversity (different from others) and more likely to be thereal userrsquos queryrdquo

To summarize the user study the reasonings used by theadversaries were extremely diverse which seems to explainwhy the overall probability of guessing correctly was closeto that of random guessing

PerformanceWe report the performance of several aspects of SybilQueryincluding the time needed to preprocess the traffic databaseand the real-time performance of querying an LBS Wealso compare the performance of SybilQuery against animplementation of spatial cloaking All the results presentedbelow were obtained on a 173GHz Pentium M laptop with512 MB RAM Each experiment was repeated 50 times andthe value of the privacy parameter k was fixed to be 4 unlessotherwise indicated

One-Time and Once-Per-Trip Costs The one-time cost(offline step) for SybilQuery involves preprocessing of thetraffic database This step processed 529533 trips and tookapproximately 2 hours and 16 minutes

The once-per-trip costs for SybilQuery involve end-pointgeneration and path generation Generating a single pair ofendpoints at the beginning of a user trip took an average of547 seconds with a standard deviation of 102 seconds Tomeasure the cost of generating paths we randomly chosetrips with lengths varying between 5 and 30 minutes Themean cost of computing a path (including the networklatency to query the Microsoft MultiMap API) was 360milliseconds with a standard deviation of 150 milliseconds

Figure 8 Response size returned by Sybil-Query and spatial cloaking for mediumPOI density

Figure 9 Response size returned by spa-tial cloaking for different POI densities

Figure 10 Comparing the performance ofSybilQuery with spatial cloaking for rangequeries

QueryResponse Performance We measured the responselatency of SybilQuery by integrating it with the YahooMaps local search API [43] Figure 6 shows the latency ofsending k queries and receiving responses As expected thecost increases linearly with k since k requests are sent to theLBS each time the client makes a query

Comparison with Spatial Cloaking Spatial cloaking [19]is a state of the art technique that achieves k-anonymityby using an anonymizer to send the LBS cloaked regionscontaining at least k users The LBS returns all pointsof interest (POIs) within the cloaked region that match thequery to the anonymizer Although spatial cloaking wasoriginally proposed for static users we considered a simplevariant for mobile users in which cloaked regions grow asusers move2

We conducted experiments to compare spatial cloaking withSybilQuery Because spatial cloaking uses an anonymizerto send cloaked regions containing k clients to the LBS wecan expect the cloaked regions to increase in size as clientsmove (because the cloaked region must contain the same setof clients to preserve k-anonymity) In addition becausean LBS returns query results for the cloaked region wecan expect that increased POI density will lead to increasedqueryresponse processing time at the anonymizer Ourexperiments reported below confirmed these expectations

Figure 7 presents the results of the experiment that showsthat the size of cloaked regions increases as clients moveWe randomly chose trips varying in duration from 1 minuteto 30 minutes from the Cabspotter database We then fixedk = 4 and selected the smallest cloaked region containingat least k minus 1 other clients As Figure 7 shows the size ofthe cloaked region increases with the duration of the trip

To understand how this increase translates to queryresponseprocessing overhead we studied the size of the response(in kilobytes) received from the LBS for a fixed nearestneighbor query ldquoreturn the Chinese restaurant closest tomy current locationrdquo In spatial cloaking the anonymizersends the entire cloaked region to the LBS and must process2Although we have not carefully analyzed the privacy offered bythis scheme it suffices to compare the queryresponse performanceof spatial cloaking with that of SybilQuery

responses from the LBS to identify the closest Chineserestaurant for the client issuing the query However becauseSybilQuery sends exact street addresses the LBS sends theclosest Chinese restaurant Figure 8 compares the sizes ofthe query response for spatial cloaking and SybilQuery Asthis figure shows the query response size for SybilQueryis nearly a constant (at approximately 2KB) it only variesonly with the value of k However the response size forspatial cloaking increases linearly This is because the sizeof the cloaked region increases with trip length which inturn directly translates to an increased number of POIs thatmatch the userrsquos query

We also conducted an experiment to study the effect ofincreased POI densities on queryresponse performanceWe chose three representative nearest neighbor queries thatrequired the LBS to return the closest ldquorestaurantrdquo ldquoChineserestaurantrdquo and ldquoHunan Chinese restaurantrdquo to the clientrsquoslocation For spatial cloaking these queries representrespectively high medium and low POI densities AsFigure 9 shows with spatial cloaking the size of the queryresults depends both on POI density and on the duration ofthe trip and rises to about 5MB within 30 minutes for highPOI densities With SybilQuery the query response sizeremains fixed at 2KB (because only the closest restaurantis returned)

The experiments above used nearest-neighbour queries forwhich the difference in query responses for spatial cloakingand SybilQuery are most pronounced In contrast rangequeries such as ldquoreturn the list of all Chinese restaurants inan x-mile radiusrdquo require the LBS to process the query overan entire geographic region If a client issuing range queriesuses SybilQuery to protect his location the LBS mustprocess k geographic regions each of x-mile radius On theother hand if the client uses spatial cloaking the LBS mustonly process one geographic region whose radius is x mileslarger than the cloaked region We can therefore expectspatial cloaking to outperform SybilQuery as x increases

Figure 10 shows that this is indeed the case This figureplots the query response size against increasing values ofx for both SybilQuery and spatial cloaking For thisexperiment we assumed a circular cloaked region with afixed radius 5 kilometers and used SybilQuery with k = 4For smaller values of x SybilQuery outperforms spatial

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 5: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

(we describe in detail below our how our system representsgeographic locations for simplicity it suffices to think ofthem as fixed-size rectangular regions)We calculated τ`for each location ` using the PostGIS spatial extension ofPostgreSQL Because τ` can acquire a large number ofdiscrete values we used a simple clustering algorithm togroup ldquosimilarrdquo values of τ` into a single cluster Althoughany clustering technique may be used we found that evensimple clustering algorithms such as separating values of τ`into buckets serve our purpose well We empirically verifiedthat these clusters of the San Francisco Bay area sharesimilar semantic properties For example different shoppingcenters were grouped together into the same cluster as wereresidential areas

In addition to computing the traffic density of each geo-graphic location ` the preprocessing step also computes aprobability distribution function (PDF) π` of the length ofa trip that originates at ` In particular it computes thefollowing quantity for each location ` for all values of len

π`(len) = trips of length len that start in `

trips that start in `

This PDF is used by the endpoint generation algorithm tochoose appropriate geographic locations as sources for Sybilpaths

Temporal traffic patterns Traffic densities at a geographiclocation ` vary depending on the hour of the day and the dayof the week which results in temporal patterns in the valuesof τ` To reflect these temporal patterns in the features ofeach geographic location we define six kinds of temporalstates based on the patterns in our dataset namely ldquopeakintervalweekdayrdquo (7pm-11pm) ldquonormal intervalweekdayrdquo(6am-7pm and 11pm-3am) ldquooff-peak intervalweekdayrdquo(3am-6am) ldquopeak intervalweekendrdquo (7am-11pm) ldquonormalintervalweekendrdquo (6am-7pm and 11pm-3am) and ldquooff-peakintervalweekendrdquo (3am-6am) The precomputation stepcomputes six values of τ` for each location correspondingto each of the temporal states above When the user suppliesa source and a destination the endpoint generator choosesthe value of τ` to compute synthetic endpoints based uponthe timestamp in the userrsquos input

Representing geographic locations To compute τ` thepreprocessing algorithm needs an appropriate data structureto represent the location ` This data structure mustsimultaneously balance precision and scalability ie itmust store precise values of τ` for each location ` andmust readily scale to large geographic regions such as theSan Francisco Bay area To better illustrate this tradeoffsuppose that geographic locations are represented usingfixed-size rectangular blocks as was the case in an earlyimplementation of our prototype We found that a block sizeof 400mtimes400m resulted in a manageable number of blocksfor the entire Bay area (about 100000 blocks) but did notaccurately represent traffic densities in crowded areas suchas the downtown region and the airport On the other handa block size of 25mtimes25m accurately represented trafficdensities but resulted in a large number of blocks (about64 million blocks)

(a) San Francisco Bay areaThe airport appears on thelower right corner of thisfigure

(b) San Francisco airportBlack blocks have high traf-fic densities

Figure 2 Quadtree rep-resentations of geographiclocations

SybilQuery therefore usesan adaptive data structurethe Quadtree [12] to repre-sent traffic densities Thisdata structure represents ge-ographic locations using blocksof varying sizes (smallestsize is 25mtimes25m) depend-ing on the traffic density ofthat location In particu-lar geographic regions withnon-uniform distribution oftraffic are represented usingsmall block sizes while re-gions with a uniform dis-tribution of traffic are rep-resented using larger blocksizes Using the Quadtreedata structure we were ableto represent traffic densitiesfor the entire San FranciscoBay area using just 16000blocks

Output of preprocessingFigure 2 pictorially repre-sents the output of the pre-processing step Figure 2(a) shows the quadtree represen-tation of the entire San Francisco Bay area The adaptivenature of the Quadtree data structure results in differentblock sizes for different geographic locations Althougheach of these blocks has uniform traffic density in ourexperience areas with high traffic density tend to berepresented with smaller block sizes This effect is mostpronounced in areas with dense traffic such as the airportdepicted in Figure 2(b) Note that the regions with highesttraffic density such as the freeway and the pickup area infront of the airport are represented using small blocks whileareas inside the airport that have no traffic are representedusing larger block sizes Not all blocks have the sametraffic density in Figure 2(b) blocks with the highest trafficdensities are filled in black

Generating Sybil Endpoints When provided with a sourceand a destination input by the user the endpoint generatorcomputes a Sybil source destination pair as shown inAlgorithm 1 (our prototype adapts the same algorithm togenerate k minus 1 Sybil source destination pairs for brevitywe only illustrate the algorithm for one pair) It first finds thegeographic locations (in the Quadtree representation) `srcand `dst that contain the source and destination addressesinput by the user Next it computes the set of all sourcelocations ` that satisfy two conditions (1) the traffic densityof ` is approximately that of `src and (2) the probability ofa trip of length dist originating from ` matches that of `srcThe key intuition behind this step is to find the set of sourcelocations that closely resemble the source location input bythe user It randomly chooses a source location `srcprime fromthis set of locations and finds a destination location `dstprime

within a radius dist of `srcprime whose traffic density matchesthat of `dst The algorithm generates end points that are ofsimilar lengths but not exactly the same length as the real

Algorithm Generate Sybil Endpoints(src dst)Input (a) src Street address of userrsquos source

(b) dst Street address of userrsquos destinationOutput (srcprime dstprime) A Sybil sourcedestination pairdist = Euclidean distance between src and dst1`src `dst = geographic locations that contain src dst2Sources = Set of all locations ` with (i) τ` asymp τ`src and (ii) π`(dist)asymp π`src (dist)3`srcprime = random location isin Sources4Destinations = Set of all locations ` with (i) τ` asymp τ`dst and (ii) Euclidean distance5between `srcprime and `asymp dist`dstprime = random location isin Destinations6srcprime dstprime = Reverse geocode random points in `srcprime `dstprime 7return (srcprime dstprime)8

Algorithm 1 Generating Sybil endpoints

client path This is to allow for the client path or the Sybilpaths to make occasional detours in the path generation step

(a) Random point in geo-graphic location

(b) Street address closest torandom point via reversegeocoding

Figure 3 Finding a pointin a block using reversegeocoding

The last step is to identifyactual addresses within theSybil sourcedestination lo-cations just identified Todo so it randomly chooses apoint within the geographiclocation However thispoint may reside in non-driveable terrain as shownin Figure 3(a) If thispoint were returned as asourcedestination an adver-sary can easily identify thispoint as a Sybil address Toavoid such cases we reversegeocode this point (usingan API provided by GoogleMaps [18]) to find the streetaddress closest to the pointidentified Doing so greatlyimproves the quality of theendpoints generated by ouralgorithm Figure 3(b)shows the closest street ad-dress of the point identifiedin Figure 3(a)

The Path GeneratorThe SybilQuery prototype implements the path generatorusing the Microsoft Multimap API [35] When supplied witha source and a destination this API produces a sequenceof waypoints representing the path to the destination Asnoted earlier SybilQuery admits the use of any off-the-shelfpath generator such as those used by on-board navigationsystems This feature helps SybilQuery produce paths thatautomatically account for environmental factors such asroad closures and one ways

The Multimap API currently only produces the shortest pathto the destination thus making our prototype vulnerable toattack if the user follows a longer path to the destinationHowever as noted earlier this problem can easily be fixedwith the use of an API that generates a choice of paths tothe destination (or an API that allows the user to specify awaypoint that must be included enroute the destination) The

(a) Real trip The valuesof τ` for the source anddestination were 22 and 247respectively

(b) One of the Sybiltrips The values of τ`

for the source and des-tination were 22 and255 respectively

Figure 4 A real user trip and a Sybil trip generated bySybilQuery

path generation algorithm could then choose paths (both realand Sybil paths) based upon a probability distribution of thelength of path that users typically follow to their destination

The Query GeneratorThe basic version of SybilQueryrsquos query generator simulatesuser movement along each path It simulates the movementof users along Sybil paths at approximately the projectedspeed along that path (Information about the average speedalong a path is obtained is returned by most off-the-shelfpath generators) When the user sends a query to theLBS the query generator obtains the current location of thesimulated users and sends these as Sybil queries to the LBSWe have also interfaced the query generator with the Yahoolocal API [44] to more accurately simulate movement alongSybil paths under the constraints of current traffic Thus ifthere is traffic congestion at a particular location SybilQuerycan either simulate slower movement in that location orsimulate a detour (and request the path generator to generatea fresh path to the destination) Note that Sybil paths getcongested and decongested independently from real pathsWhen querying an LBS for online information the queryis made for all the Sybil locations as well as the real clientlocation so that the uncertainty at the LBS about the clientlocation remains the same Although querying an LBSthat reports current traffic conditions renders the prototypevulnerable to malicious LBSs that report false traffic datathe query generator can use the N -variant queries approachdescribed earlier to counter this threat

ExampleFigure 4 presents SybilQuery in action It shows a realuser trip (Figure 4(a)) obtained from the Cabspotter tracesfor another month and one of the Sybil trips generated bySybilQuery (Figure 4(b)) k was set to 4 for this exampleThe real trip (at approximately 6pm) originates at a homein Daly City and ends at a shopping area in downtown SanFrancisco All the Sybil trips generated by SybilQuery alsostarted in residential areas with similar traffic densities asthe source of the real trip and ended in a crowded region ofSan Francisco downtown Figure 4 also reports the trafficdensities for the sources and destinations for the real andSybil trip generated by SybilQuery A key point to notefrom this example is that traffic densities accurately capture

k questions correct Probability4 75 20 0266 75 14 019

Figure 5 Results of a user study with 15 volunteers

semantic properties (eg ldquoresidential areardquo ldquodowntownrdquoldquoshopping areardquo) of geographic regions

EVALUATIONIn this section we present the results of our evaluation ofthe SybilQuery prototype We evaluated both the privacyand the performance offered by SybilQuery

PrivacyTo evaluate the quality of the Sybil paths generated by oursystem we conducted a user study The high-level ideabehind the user study was to give the working system toadversarial users who would try to break the system bytrying to find the real user path hidden between Sybil pathsOur methodology was as followsmdashwe picked real pathsat random from the Cabspotter traces (different from themonth-long traces that were used to seed our prototypersquosendpoint generator) and used SybilQuery to generate Sybilpaths with different values of k We also performed aquantitative evaluation of privacy by devising a new metricThe intuition behind this metric is that an adversary who hasaccess to a database R of real paths and a database S ofSybil paths corresponding to paths in R should be unableto differentiate between the two databases We presentthe results from this evaluation in our detailed technicalreport [40]

User Study We conducted the user study with 15volunteers each of whom had to answer ten questions Eachquestion presented k paths (one real path and k minus 1 Sybilpaths) to the volunteer who had to identify the real pathfrom the Sybil paths Five questions had k = 4 while theremaining five had k = 6 Each volunteer was presentedwith a different set of ten questions ie our study obtainedresponses for 150 questions in all Volunteers could viewpaths using the Google Maps API and were instructed touse any tools at their disposal such as zooming into the mapstreet views and local information about the San FranciscoBay area in their attempt to distinguish real paths from Sybilpaths All our volunteers were computer science graduatesfamiliar with the basic geography of the San Francisco Bayarea

Figure 5 presents the results of this study and shows thenumber of questions for which users were able to correctlyidentify the real path from the Sybil paths As this figureindicates volunteers were able to correctly guess the realpath with a probability of 026 for k = 4 and 019 fork = 6 These probabilities are close to their expectedvalues (025 and 017 respectively) We further calculatedthe per-user standard deviation of the probability of correctguesses and found these values to be quite low (019 and015 respectively) These results lead us to conclude thatthe Sybil paths qualitatively resemble real paths

Figure 6 Queryresponselatency of SybilQuery

Figure 7 Spatial cloakingIncrease of cloaked regionradius with trip length

Sample responses from the user study gave us an interestingperspective into the mind of an adversary trying to breakthe SybilQuery system Several users selected a false tripas the ldquoodd man outrdquo since all the other trips (includingthe real trip) were similar Different users used contrastingrationale to guess real user trips For example some userswrongly selected shortest paths since the real path ldquolookedtoo circuitous to be truerdquo Yet others wrongly selectedcircuitous paths as they felt that ldquooutside human knowledgecould be directing this routerdquo Users typically found it hardto guess when the trip started late (ieif the first querywas made after travelling some distance) since ldquothe startingpoint seemed a bit oddrdquo Sybil paths with a prominentendpoint such as a University or a Metro station were oftenselected by users with responses such as ldquodestination isUniversity (different from others) and more likely to be thereal userrsquos queryrdquo

To summarize the user study the reasonings used by theadversaries were extremely diverse which seems to explainwhy the overall probability of guessing correctly was closeto that of random guessing

PerformanceWe report the performance of several aspects of SybilQueryincluding the time needed to preprocess the traffic databaseand the real-time performance of querying an LBS Wealso compare the performance of SybilQuery against animplementation of spatial cloaking All the results presentedbelow were obtained on a 173GHz Pentium M laptop with512 MB RAM Each experiment was repeated 50 times andthe value of the privacy parameter k was fixed to be 4 unlessotherwise indicated

One-Time and Once-Per-Trip Costs The one-time cost(offline step) for SybilQuery involves preprocessing of thetraffic database This step processed 529533 trips and tookapproximately 2 hours and 16 minutes

The once-per-trip costs for SybilQuery involve end-pointgeneration and path generation Generating a single pair ofendpoints at the beginning of a user trip took an average of547 seconds with a standard deviation of 102 seconds Tomeasure the cost of generating paths we randomly chosetrips with lengths varying between 5 and 30 minutes Themean cost of computing a path (including the networklatency to query the Microsoft MultiMap API) was 360milliseconds with a standard deviation of 150 milliseconds

Figure 8 Response size returned by Sybil-Query and spatial cloaking for mediumPOI density

Figure 9 Response size returned by spa-tial cloaking for different POI densities

Figure 10 Comparing the performance ofSybilQuery with spatial cloaking for rangequeries

QueryResponse Performance We measured the responselatency of SybilQuery by integrating it with the YahooMaps local search API [43] Figure 6 shows the latency ofsending k queries and receiving responses As expected thecost increases linearly with k since k requests are sent to theLBS each time the client makes a query

Comparison with Spatial Cloaking Spatial cloaking [19]is a state of the art technique that achieves k-anonymityby using an anonymizer to send the LBS cloaked regionscontaining at least k users The LBS returns all pointsof interest (POIs) within the cloaked region that match thequery to the anonymizer Although spatial cloaking wasoriginally proposed for static users we considered a simplevariant for mobile users in which cloaked regions grow asusers move2

We conducted experiments to compare spatial cloaking withSybilQuery Because spatial cloaking uses an anonymizerto send cloaked regions containing k clients to the LBS wecan expect the cloaked regions to increase in size as clientsmove (because the cloaked region must contain the same setof clients to preserve k-anonymity) In addition becausean LBS returns query results for the cloaked region wecan expect that increased POI density will lead to increasedqueryresponse processing time at the anonymizer Ourexperiments reported below confirmed these expectations

Figure 7 presents the results of the experiment that showsthat the size of cloaked regions increases as clients moveWe randomly chose trips varying in duration from 1 minuteto 30 minutes from the Cabspotter database We then fixedk = 4 and selected the smallest cloaked region containingat least k minus 1 other clients As Figure 7 shows the size ofthe cloaked region increases with the duration of the trip

To understand how this increase translates to queryresponseprocessing overhead we studied the size of the response(in kilobytes) received from the LBS for a fixed nearestneighbor query ldquoreturn the Chinese restaurant closest tomy current locationrdquo In spatial cloaking the anonymizersends the entire cloaked region to the LBS and must process2Although we have not carefully analyzed the privacy offered bythis scheme it suffices to compare the queryresponse performanceof spatial cloaking with that of SybilQuery

responses from the LBS to identify the closest Chineserestaurant for the client issuing the query However becauseSybilQuery sends exact street addresses the LBS sends theclosest Chinese restaurant Figure 8 compares the sizes ofthe query response for spatial cloaking and SybilQuery Asthis figure shows the query response size for SybilQueryis nearly a constant (at approximately 2KB) it only variesonly with the value of k However the response size forspatial cloaking increases linearly This is because the sizeof the cloaked region increases with trip length which inturn directly translates to an increased number of POIs thatmatch the userrsquos query

We also conducted an experiment to study the effect ofincreased POI densities on queryresponse performanceWe chose three representative nearest neighbor queries thatrequired the LBS to return the closest ldquorestaurantrdquo ldquoChineserestaurantrdquo and ldquoHunan Chinese restaurantrdquo to the clientrsquoslocation For spatial cloaking these queries representrespectively high medium and low POI densities AsFigure 9 shows with spatial cloaking the size of the queryresults depends both on POI density and on the duration ofthe trip and rises to about 5MB within 30 minutes for highPOI densities With SybilQuery the query response sizeremains fixed at 2KB (because only the closest restaurantis returned)

The experiments above used nearest-neighbour queries forwhich the difference in query responses for spatial cloakingand SybilQuery are most pronounced In contrast rangequeries such as ldquoreturn the list of all Chinese restaurants inan x-mile radiusrdquo require the LBS to process the query overan entire geographic region If a client issuing range queriesuses SybilQuery to protect his location the LBS mustprocess k geographic regions each of x-mile radius On theother hand if the client uses spatial cloaking the LBS mustonly process one geographic region whose radius is x mileslarger than the cloaked region We can therefore expectspatial cloaking to outperform SybilQuery as x increases

Figure 10 shows that this is indeed the case This figureplots the query response size against increasing values ofx for both SybilQuery and spatial cloaking For thisexperiment we assumed a circular cloaked region with afixed radius 5 kilometers and used SybilQuery with k = 4For smaller values of x SybilQuery outperforms spatial

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 6: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

Algorithm Generate Sybil Endpoints(src dst)Input (a) src Street address of userrsquos source

(b) dst Street address of userrsquos destinationOutput (srcprime dstprime) A Sybil sourcedestination pairdist = Euclidean distance between src and dst1`src `dst = geographic locations that contain src dst2Sources = Set of all locations ` with (i) τ` asymp τ`src and (ii) π`(dist)asymp π`src (dist)3`srcprime = random location isin Sources4Destinations = Set of all locations ` with (i) τ` asymp τ`dst and (ii) Euclidean distance5between `srcprime and `asymp dist`dstprime = random location isin Destinations6srcprime dstprime = Reverse geocode random points in `srcprime `dstprime 7return (srcprime dstprime)8

Algorithm 1 Generating Sybil endpoints

client path This is to allow for the client path or the Sybilpaths to make occasional detours in the path generation step

(a) Random point in geo-graphic location

(b) Street address closest torandom point via reversegeocoding

Figure 3 Finding a pointin a block using reversegeocoding

The last step is to identifyactual addresses within theSybil sourcedestination lo-cations just identified Todo so it randomly chooses apoint within the geographiclocation However thispoint may reside in non-driveable terrain as shownin Figure 3(a) If thispoint were returned as asourcedestination an adver-sary can easily identify thispoint as a Sybil address Toavoid such cases we reversegeocode this point (usingan API provided by GoogleMaps [18]) to find the streetaddress closest to the pointidentified Doing so greatlyimproves the quality of theendpoints generated by ouralgorithm Figure 3(b)shows the closest street ad-dress of the point identifiedin Figure 3(a)

The Path GeneratorThe SybilQuery prototype implements the path generatorusing the Microsoft Multimap API [35] When supplied witha source and a destination this API produces a sequenceof waypoints representing the path to the destination Asnoted earlier SybilQuery admits the use of any off-the-shelfpath generator such as those used by on-board navigationsystems This feature helps SybilQuery produce paths thatautomatically account for environmental factors such asroad closures and one ways

The Multimap API currently only produces the shortest pathto the destination thus making our prototype vulnerable toattack if the user follows a longer path to the destinationHowever as noted earlier this problem can easily be fixedwith the use of an API that generates a choice of paths tothe destination (or an API that allows the user to specify awaypoint that must be included enroute the destination) The

(a) Real trip The valuesof τ` for the source anddestination were 22 and 247respectively

(b) One of the Sybiltrips The values of τ`

for the source and des-tination were 22 and255 respectively

Figure 4 A real user trip and a Sybil trip generated bySybilQuery

path generation algorithm could then choose paths (both realand Sybil paths) based upon a probability distribution of thelength of path that users typically follow to their destination

The Query GeneratorThe basic version of SybilQueryrsquos query generator simulatesuser movement along each path It simulates the movementof users along Sybil paths at approximately the projectedspeed along that path (Information about the average speedalong a path is obtained is returned by most off-the-shelfpath generators) When the user sends a query to theLBS the query generator obtains the current location of thesimulated users and sends these as Sybil queries to the LBSWe have also interfaced the query generator with the Yahoolocal API [44] to more accurately simulate movement alongSybil paths under the constraints of current traffic Thus ifthere is traffic congestion at a particular location SybilQuerycan either simulate slower movement in that location orsimulate a detour (and request the path generator to generatea fresh path to the destination) Note that Sybil paths getcongested and decongested independently from real pathsWhen querying an LBS for online information the queryis made for all the Sybil locations as well as the real clientlocation so that the uncertainty at the LBS about the clientlocation remains the same Although querying an LBSthat reports current traffic conditions renders the prototypevulnerable to malicious LBSs that report false traffic datathe query generator can use the N -variant queries approachdescribed earlier to counter this threat

ExampleFigure 4 presents SybilQuery in action It shows a realuser trip (Figure 4(a)) obtained from the Cabspotter tracesfor another month and one of the Sybil trips generated bySybilQuery (Figure 4(b)) k was set to 4 for this exampleThe real trip (at approximately 6pm) originates at a homein Daly City and ends at a shopping area in downtown SanFrancisco All the Sybil trips generated by SybilQuery alsostarted in residential areas with similar traffic densities asthe source of the real trip and ended in a crowded region ofSan Francisco downtown Figure 4 also reports the trafficdensities for the sources and destinations for the real andSybil trip generated by SybilQuery A key point to notefrom this example is that traffic densities accurately capture

k questions correct Probability4 75 20 0266 75 14 019

Figure 5 Results of a user study with 15 volunteers

semantic properties (eg ldquoresidential areardquo ldquodowntownrdquoldquoshopping areardquo) of geographic regions

EVALUATIONIn this section we present the results of our evaluation ofthe SybilQuery prototype We evaluated both the privacyand the performance offered by SybilQuery

PrivacyTo evaluate the quality of the Sybil paths generated by oursystem we conducted a user study The high-level ideabehind the user study was to give the working system toadversarial users who would try to break the system bytrying to find the real user path hidden between Sybil pathsOur methodology was as followsmdashwe picked real pathsat random from the Cabspotter traces (different from themonth-long traces that were used to seed our prototypersquosendpoint generator) and used SybilQuery to generate Sybilpaths with different values of k We also performed aquantitative evaluation of privacy by devising a new metricThe intuition behind this metric is that an adversary who hasaccess to a database R of real paths and a database S ofSybil paths corresponding to paths in R should be unableto differentiate between the two databases We presentthe results from this evaluation in our detailed technicalreport [40]

User Study We conducted the user study with 15volunteers each of whom had to answer ten questions Eachquestion presented k paths (one real path and k minus 1 Sybilpaths) to the volunteer who had to identify the real pathfrom the Sybil paths Five questions had k = 4 while theremaining five had k = 6 Each volunteer was presentedwith a different set of ten questions ie our study obtainedresponses for 150 questions in all Volunteers could viewpaths using the Google Maps API and were instructed touse any tools at their disposal such as zooming into the mapstreet views and local information about the San FranciscoBay area in their attempt to distinguish real paths from Sybilpaths All our volunteers were computer science graduatesfamiliar with the basic geography of the San Francisco Bayarea

Figure 5 presents the results of this study and shows thenumber of questions for which users were able to correctlyidentify the real path from the Sybil paths As this figureindicates volunteers were able to correctly guess the realpath with a probability of 026 for k = 4 and 019 fork = 6 These probabilities are close to their expectedvalues (025 and 017 respectively) We further calculatedthe per-user standard deviation of the probability of correctguesses and found these values to be quite low (019 and015 respectively) These results lead us to conclude thatthe Sybil paths qualitatively resemble real paths

Figure 6 Queryresponselatency of SybilQuery

Figure 7 Spatial cloakingIncrease of cloaked regionradius with trip length

Sample responses from the user study gave us an interestingperspective into the mind of an adversary trying to breakthe SybilQuery system Several users selected a false tripas the ldquoodd man outrdquo since all the other trips (includingthe real trip) were similar Different users used contrastingrationale to guess real user trips For example some userswrongly selected shortest paths since the real path ldquolookedtoo circuitous to be truerdquo Yet others wrongly selectedcircuitous paths as they felt that ldquooutside human knowledgecould be directing this routerdquo Users typically found it hardto guess when the trip started late (ieif the first querywas made after travelling some distance) since ldquothe startingpoint seemed a bit oddrdquo Sybil paths with a prominentendpoint such as a University or a Metro station were oftenselected by users with responses such as ldquodestination isUniversity (different from others) and more likely to be thereal userrsquos queryrdquo

To summarize the user study the reasonings used by theadversaries were extremely diverse which seems to explainwhy the overall probability of guessing correctly was closeto that of random guessing

PerformanceWe report the performance of several aspects of SybilQueryincluding the time needed to preprocess the traffic databaseand the real-time performance of querying an LBS Wealso compare the performance of SybilQuery against animplementation of spatial cloaking All the results presentedbelow were obtained on a 173GHz Pentium M laptop with512 MB RAM Each experiment was repeated 50 times andthe value of the privacy parameter k was fixed to be 4 unlessotherwise indicated

One-Time and Once-Per-Trip Costs The one-time cost(offline step) for SybilQuery involves preprocessing of thetraffic database This step processed 529533 trips and tookapproximately 2 hours and 16 minutes

The once-per-trip costs for SybilQuery involve end-pointgeneration and path generation Generating a single pair ofendpoints at the beginning of a user trip took an average of547 seconds with a standard deviation of 102 seconds Tomeasure the cost of generating paths we randomly chosetrips with lengths varying between 5 and 30 minutes Themean cost of computing a path (including the networklatency to query the Microsoft MultiMap API) was 360milliseconds with a standard deviation of 150 milliseconds

Figure 8 Response size returned by Sybil-Query and spatial cloaking for mediumPOI density

Figure 9 Response size returned by spa-tial cloaking for different POI densities

Figure 10 Comparing the performance ofSybilQuery with spatial cloaking for rangequeries

QueryResponse Performance We measured the responselatency of SybilQuery by integrating it with the YahooMaps local search API [43] Figure 6 shows the latency ofsending k queries and receiving responses As expected thecost increases linearly with k since k requests are sent to theLBS each time the client makes a query

Comparison with Spatial Cloaking Spatial cloaking [19]is a state of the art technique that achieves k-anonymityby using an anonymizer to send the LBS cloaked regionscontaining at least k users The LBS returns all pointsof interest (POIs) within the cloaked region that match thequery to the anonymizer Although spatial cloaking wasoriginally proposed for static users we considered a simplevariant for mobile users in which cloaked regions grow asusers move2

We conducted experiments to compare spatial cloaking withSybilQuery Because spatial cloaking uses an anonymizerto send cloaked regions containing k clients to the LBS wecan expect the cloaked regions to increase in size as clientsmove (because the cloaked region must contain the same setof clients to preserve k-anonymity) In addition becausean LBS returns query results for the cloaked region wecan expect that increased POI density will lead to increasedqueryresponse processing time at the anonymizer Ourexperiments reported below confirmed these expectations

Figure 7 presents the results of the experiment that showsthat the size of cloaked regions increases as clients moveWe randomly chose trips varying in duration from 1 minuteto 30 minutes from the Cabspotter database We then fixedk = 4 and selected the smallest cloaked region containingat least k minus 1 other clients As Figure 7 shows the size ofthe cloaked region increases with the duration of the trip

To understand how this increase translates to queryresponseprocessing overhead we studied the size of the response(in kilobytes) received from the LBS for a fixed nearestneighbor query ldquoreturn the Chinese restaurant closest tomy current locationrdquo In spatial cloaking the anonymizersends the entire cloaked region to the LBS and must process2Although we have not carefully analyzed the privacy offered bythis scheme it suffices to compare the queryresponse performanceof spatial cloaking with that of SybilQuery

responses from the LBS to identify the closest Chineserestaurant for the client issuing the query However becauseSybilQuery sends exact street addresses the LBS sends theclosest Chinese restaurant Figure 8 compares the sizes ofthe query response for spatial cloaking and SybilQuery Asthis figure shows the query response size for SybilQueryis nearly a constant (at approximately 2KB) it only variesonly with the value of k However the response size forspatial cloaking increases linearly This is because the sizeof the cloaked region increases with trip length which inturn directly translates to an increased number of POIs thatmatch the userrsquos query

We also conducted an experiment to study the effect ofincreased POI densities on queryresponse performanceWe chose three representative nearest neighbor queries thatrequired the LBS to return the closest ldquorestaurantrdquo ldquoChineserestaurantrdquo and ldquoHunan Chinese restaurantrdquo to the clientrsquoslocation For spatial cloaking these queries representrespectively high medium and low POI densities AsFigure 9 shows with spatial cloaking the size of the queryresults depends both on POI density and on the duration ofthe trip and rises to about 5MB within 30 minutes for highPOI densities With SybilQuery the query response sizeremains fixed at 2KB (because only the closest restaurantis returned)

The experiments above used nearest-neighbour queries forwhich the difference in query responses for spatial cloakingand SybilQuery are most pronounced In contrast rangequeries such as ldquoreturn the list of all Chinese restaurants inan x-mile radiusrdquo require the LBS to process the query overan entire geographic region If a client issuing range queriesuses SybilQuery to protect his location the LBS mustprocess k geographic regions each of x-mile radius On theother hand if the client uses spatial cloaking the LBS mustonly process one geographic region whose radius is x mileslarger than the cloaked region We can therefore expectspatial cloaking to outperform SybilQuery as x increases

Figure 10 shows that this is indeed the case This figureplots the query response size against increasing values ofx for both SybilQuery and spatial cloaking For thisexperiment we assumed a circular cloaked region with afixed radius 5 kilometers and used SybilQuery with k = 4For smaller values of x SybilQuery outperforms spatial

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 7: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

k questions correct Probability4 75 20 0266 75 14 019

Figure 5 Results of a user study with 15 volunteers

semantic properties (eg ldquoresidential areardquo ldquodowntownrdquoldquoshopping areardquo) of geographic regions

EVALUATIONIn this section we present the results of our evaluation ofthe SybilQuery prototype We evaluated both the privacyand the performance offered by SybilQuery

PrivacyTo evaluate the quality of the Sybil paths generated by oursystem we conducted a user study The high-level ideabehind the user study was to give the working system toadversarial users who would try to break the system bytrying to find the real user path hidden between Sybil pathsOur methodology was as followsmdashwe picked real pathsat random from the Cabspotter traces (different from themonth-long traces that were used to seed our prototypersquosendpoint generator) and used SybilQuery to generate Sybilpaths with different values of k We also performed aquantitative evaluation of privacy by devising a new metricThe intuition behind this metric is that an adversary who hasaccess to a database R of real paths and a database S ofSybil paths corresponding to paths in R should be unableto differentiate between the two databases We presentthe results from this evaluation in our detailed technicalreport [40]

User Study We conducted the user study with 15volunteers each of whom had to answer ten questions Eachquestion presented k paths (one real path and k minus 1 Sybilpaths) to the volunteer who had to identify the real pathfrom the Sybil paths Five questions had k = 4 while theremaining five had k = 6 Each volunteer was presentedwith a different set of ten questions ie our study obtainedresponses for 150 questions in all Volunteers could viewpaths using the Google Maps API and were instructed touse any tools at their disposal such as zooming into the mapstreet views and local information about the San FranciscoBay area in their attempt to distinguish real paths from Sybilpaths All our volunteers were computer science graduatesfamiliar with the basic geography of the San Francisco Bayarea

Figure 5 presents the results of this study and shows thenumber of questions for which users were able to correctlyidentify the real path from the Sybil paths As this figureindicates volunteers were able to correctly guess the realpath with a probability of 026 for k = 4 and 019 fork = 6 These probabilities are close to their expectedvalues (025 and 017 respectively) We further calculatedthe per-user standard deviation of the probability of correctguesses and found these values to be quite low (019 and015 respectively) These results lead us to conclude thatthe Sybil paths qualitatively resemble real paths

Figure 6 Queryresponselatency of SybilQuery

Figure 7 Spatial cloakingIncrease of cloaked regionradius with trip length

Sample responses from the user study gave us an interestingperspective into the mind of an adversary trying to breakthe SybilQuery system Several users selected a false tripas the ldquoodd man outrdquo since all the other trips (includingthe real trip) were similar Different users used contrastingrationale to guess real user trips For example some userswrongly selected shortest paths since the real path ldquolookedtoo circuitous to be truerdquo Yet others wrongly selectedcircuitous paths as they felt that ldquooutside human knowledgecould be directing this routerdquo Users typically found it hardto guess when the trip started late (ieif the first querywas made after travelling some distance) since ldquothe startingpoint seemed a bit oddrdquo Sybil paths with a prominentendpoint such as a University or a Metro station were oftenselected by users with responses such as ldquodestination isUniversity (different from others) and more likely to be thereal userrsquos queryrdquo

To summarize the user study the reasonings used by theadversaries were extremely diverse which seems to explainwhy the overall probability of guessing correctly was closeto that of random guessing

PerformanceWe report the performance of several aspects of SybilQueryincluding the time needed to preprocess the traffic databaseand the real-time performance of querying an LBS Wealso compare the performance of SybilQuery against animplementation of spatial cloaking All the results presentedbelow were obtained on a 173GHz Pentium M laptop with512 MB RAM Each experiment was repeated 50 times andthe value of the privacy parameter k was fixed to be 4 unlessotherwise indicated

One-Time and Once-Per-Trip Costs The one-time cost(offline step) for SybilQuery involves preprocessing of thetraffic database This step processed 529533 trips and tookapproximately 2 hours and 16 minutes

The once-per-trip costs for SybilQuery involve end-pointgeneration and path generation Generating a single pair ofendpoints at the beginning of a user trip took an average of547 seconds with a standard deviation of 102 seconds Tomeasure the cost of generating paths we randomly chosetrips with lengths varying between 5 and 30 minutes Themean cost of computing a path (including the networklatency to query the Microsoft MultiMap API) was 360milliseconds with a standard deviation of 150 milliseconds

Figure 8 Response size returned by Sybil-Query and spatial cloaking for mediumPOI density

Figure 9 Response size returned by spa-tial cloaking for different POI densities

Figure 10 Comparing the performance ofSybilQuery with spatial cloaking for rangequeries

QueryResponse Performance We measured the responselatency of SybilQuery by integrating it with the YahooMaps local search API [43] Figure 6 shows the latency ofsending k queries and receiving responses As expected thecost increases linearly with k since k requests are sent to theLBS each time the client makes a query

Comparison with Spatial Cloaking Spatial cloaking [19]is a state of the art technique that achieves k-anonymityby using an anonymizer to send the LBS cloaked regionscontaining at least k users The LBS returns all pointsof interest (POIs) within the cloaked region that match thequery to the anonymizer Although spatial cloaking wasoriginally proposed for static users we considered a simplevariant for mobile users in which cloaked regions grow asusers move2

We conducted experiments to compare spatial cloaking withSybilQuery Because spatial cloaking uses an anonymizerto send cloaked regions containing k clients to the LBS wecan expect the cloaked regions to increase in size as clientsmove (because the cloaked region must contain the same setof clients to preserve k-anonymity) In addition becausean LBS returns query results for the cloaked region wecan expect that increased POI density will lead to increasedqueryresponse processing time at the anonymizer Ourexperiments reported below confirmed these expectations

Figure 7 presents the results of the experiment that showsthat the size of cloaked regions increases as clients moveWe randomly chose trips varying in duration from 1 minuteto 30 minutes from the Cabspotter database We then fixedk = 4 and selected the smallest cloaked region containingat least k minus 1 other clients As Figure 7 shows the size ofthe cloaked region increases with the duration of the trip

To understand how this increase translates to queryresponseprocessing overhead we studied the size of the response(in kilobytes) received from the LBS for a fixed nearestneighbor query ldquoreturn the Chinese restaurant closest tomy current locationrdquo In spatial cloaking the anonymizersends the entire cloaked region to the LBS and must process2Although we have not carefully analyzed the privacy offered bythis scheme it suffices to compare the queryresponse performanceof spatial cloaking with that of SybilQuery

responses from the LBS to identify the closest Chineserestaurant for the client issuing the query However becauseSybilQuery sends exact street addresses the LBS sends theclosest Chinese restaurant Figure 8 compares the sizes ofthe query response for spatial cloaking and SybilQuery Asthis figure shows the query response size for SybilQueryis nearly a constant (at approximately 2KB) it only variesonly with the value of k However the response size forspatial cloaking increases linearly This is because the sizeof the cloaked region increases with trip length which inturn directly translates to an increased number of POIs thatmatch the userrsquos query

We also conducted an experiment to study the effect ofincreased POI densities on queryresponse performanceWe chose three representative nearest neighbor queries thatrequired the LBS to return the closest ldquorestaurantrdquo ldquoChineserestaurantrdquo and ldquoHunan Chinese restaurantrdquo to the clientrsquoslocation For spatial cloaking these queries representrespectively high medium and low POI densities AsFigure 9 shows with spatial cloaking the size of the queryresults depends both on POI density and on the duration ofthe trip and rises to about 5MB within 30 minutes for highPOI densities With SybilQuery the query response sizeremains fixed at 2KB (because only the closest restaurantis returned)

The experiments above used nearest-neighbour queries forwhich the difference in query responses for spatial cloakingand SybilQuery are most pronounced In contrast rangequeries such as ldquoreturn the list of all Chinese restaurants inan x-mile radiusrdquo require the LBS to process the query overan entire geographic region If a client issuing range queriesuses SybilQuery to protect his location the LBS mustprocess k geographic regions each of x-mile radius On theother hand if the client uses spatial cloaking the LBS mustonly process one geographic region whose radius is x mileslarger than the cloaked region We can therefore expectspatial cloaking to outperform SybilQuery as x increases

Figure 10 shows that this is indeed the case This figureplots the query response size against increasing values ofx for both SybilQuery and spatial cloaking For thisexperiment we assumed a circular cloaked region with afixed radius 5 kilometers and used SybilQuery with k = 4For smaller values of x SybilQuery outperforms spatial

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 8: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

Figure 8 Response size returned by Sybil-Query and spatial cloaking for mediumPOI density

Figure 9 Response size returned by spa-tial cloaking for different POI densities

Figure 10 Comparing the performance ofSybilQuery with spatial cloaking for rangequeries

QueryResponse Performance We measured the responselatency of SybilQuery by integrating it with the YahooMaps local search API [43] Figure 6 shows the latency ofsending k queries and receiving responses As expected thecost increases linearly with k since k requests are sent to theLBS each time the client makes a query

Comparison with Spatial Cloaking Spatial cloaking [19]is a state of the art technique that achieves k-anonymityby using an anonymizer to send the LBS cloaked regionscontaining at least k users The LBS returns all pointsof interest (POIs) within the cloaked region that match thequery to the anonymizer Although spatial cloaking wasoriginally proposed for static users we considered a simplevariant for mobile users in which cloaked regions grow asusers move2

We conducted experiments to compare spatial cloaking withSybilQuery Because spatial cloaking uses an anonymizerto send cloaked regions containing k clients to the LBS wecan expect the cloaked regions to increase in size as clientsmove (because the cloaked region must contain the same setof clients to preserve k-anonymity) In addition becausean LBS returns query results for the cloaked region wecan expect that increased POI density will lead to increasedqueryresponse processing time at the anonymizer Ourexperiments reported below confirmed these expectations

Figure 7 presents the results of the experiment that showsthat the size of cloaked regions increases as clients moveWe randomly chose trips varying in duration from 1 minuteto 30 minutes from the Cabspotter database We then fixedk = 4 and selected the smallest cloaked region containingat least k minus 1 other clients As Figure 7 shows the size ofthe cloaked region increases with the duration of the trip

To understand how this increase translates to queryresponseprocessing overhead we studied the size of the response(in kilobytes) received from the LBS for a fixed nearestneighbor query ldquoreturn the Chinese restaurant closest tomy current locationrdquo In spatial cloaking the anonymizersends the entire cloaked region to the LBS and must process2Although we have not carefully analyzed the privacy offered bythis scheme it suffices to compare the queryresponse performanceof spatial cloaking with that of SybilQuery

responses from the LBS to identify the closest Chineserestaurant for the client issuing the query However becauseSybilQuery sends exact street addresses the LBS sends theclosest Chinese restaurant Figure 8 compares the sizes ofthe query response for spatial cloaking and SybilQuery Asthis figure shows the query response size for SybilQueryis nearly a constant (at approximately 2KB) it only variesonly with the value of k However the response size forspatial cloaking increases linearly This is because the sizeof the cloaked region increases with trip length which inturn directly translates to an increased number of POIs thatmatch the userrsquos query

We also conducted an experiment to study the effect ofincreased POI densities on queryresponse performanceWe chose three representative nearest neighbor queries thatrequired the LBS to return the closest ldquorestaurantrdquo ldquoChineserestaurantrdquo and ldquoHunan Chinese restaurantrdquo to the clientrsquoslocation For spatial cloaking these queries representrespectively high medium and low POI densities AsFigure 9 shows with spatial cloaking the size of the queryresults depends both on POI density and on the duration ofthe trip and rises to about 5MB within 30 minutes for highPOI densities With SybilQuery the query response sizeremains fixed at 2KB (because only the closest restaurantis returned)

The experiments above used nearest-neighbour queries forwhich the difference in query responses for spatial cloakingand SybilQuery are most pronounced In contrast rangequeries such as ldquoreturn the list of all Chinese restaurants inan x-mile radiusrdquo require the LBS to process the query overan entire geographic region If a client issuing range queriesuses SybilQuery to protect his location the LBS mustprocess k geographic regions each of x-mile radius On theother hand if the client uses spatial cloaking the LBS mustonly process one geographic region whose radius is x mileslarger than the cloaked region We can therefore expectspatial cloaking to outperform SybilQuery as x increases

Figure 10 shows that this is indeed the case This figureplots the query response size against increasing values ofx for both SybilQuery and spatial cloaking For thisexperiment we assumed a circular cloaked region with afixed radius 5 kilometers and used SybilQuery with k = 4For smaller values of x SybilQuery outperforms spatial

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 9: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

cloaking because the combined area of k regions of radius xis smaller than the cloaked region However as x increasesspatial cloaking outperforms SybilQuery

Nevertheless we note that the queryresponse performanceof SybilQuery can never be worse than k times the perfor-mance of spatial cloaking even for range queries

RELATED WORKbull Synthetic locations The idea of augmenting clientlocation with synthetically generated locations to achievelocation privacy was also recently proposed by Krumm [27]in a concurrently developed system Although our work issimilar in spirit there are a number of differences betweenthe two systems Krumm uses a probabilistic model ofdriving behavior which depends on having data about theroad network and ground cover SybilQuery on the otherhand uses statistical clustering techniques for end pointgeneration and an off the shelf path generation tool whichmakes it easily scale to large regions Using false tripsfor location privacy was earlier proposed by Kino [2120] Kinorsquos work focused on reducing the computationcost so they use random walk based models for false pathgeneration which can be easily reidentified

bull Cloaking schemes Originally introduced by Gruteserand Grunwald [19] cloaking aims to achieve k-anonymityby hiding client location from the LBS both in space andtime These systems achieve spatial anonymity by sendinga cloaked region containing at least k users to the LBSThey achieve temporal cloaking by delaying a query untilat least k other users have also issued queries This basicmodel has been extended in several ways For examplework by Duckham and Kulik [8] Gedik and Liu [13 14]Bamba et al [1] and Mokbel et al [34] allow users topersonalize their privacy requirements eg by letting themdecide thresholds for spatial and temporal resolution ofcloaked regions Similarly work by Bamba et al [1] extendsthe basic model to allow for `-diverse [32] queries Otherrelated approaches include achieving k-anonymity usingpath confusion [23 25] where the anonymizing systemattempts to confuse the adversary by crossing paths whereat least two users meet

A common theme in all the above systems is the need fora trusted third-party anonymizer that ensures k-anonymity(or `-diversity) Using a third-party anonymizer has twodisadvantages First the anonymizer is the central point offailure and presents scalability and performance bottlenecksFor example an adversary could cripple the system witha denial-of-service attack on the anonymizer Moreoverbecause the anonymizer has access to sensitive informationfrom clients a compromize of the anonymizer wouldresult in a privacy breach By offering a decentralizeddesign SybilQuery avoids these shortcomings Secondmost previous approaches only provide security guaranteesfor static snapshots and do not consider history of usermovement (with the exception of work by Chow andMokbel [4] which also uses an anonymizer) Therefore ifa user asks the same query from multiple cloaked regionsas she moves the LBS could compromise her privacy bycorrelating queries from these regions These attacks are

called correlation attacks [15] SybilQuery offers improvedprotection than cloaking schemes because it does not attemptto hide the actual path of the user from the LBS Instead itensures that the adversary is unable differentiate user pathsfrom Sybil paths

bull Peer-to-peer schemes Recent research has developedtechniques based on peer-to-peer techniques to eliminate theneed for an anonymizer [17 16 5] Also related althoughproposed as a mechanism for anonymous Web access isCrowds [37] in which a query originates from a ldquocrowdrdquoof k users and the adversary is unable to identify the sourceof the query These techniques have the advantage ofbeing decentralized in nature However they still rely onthe presence of at least k participating peers in contrastSybilQuery operates autonomously independent of otherquerying peers

bull Private Information Retrieval (PIR) Cryptographictechniques based on PIR [29] offer another alternative toeliminate anonymizers [15 22] Using PIR to protect clientlocation offers strong cryptographic guarantees on privacythat SybilQuery unfortunately cannot offer Howeverthe PIR-based scheme suffers from two drawbacks thatlimit its applicability First in spite of impressive recentadvances PIR remains computationally expensive [41]For example Ghinita et al [15] employ a PIR protocolthat imposes a cost of O(n) at the server in addition toclientserver communication costs of O(

radicn) Practically

this translated to about 30 to 60 seconds of server processingtime for each location query and communication of afew megabytes of data to process a single location-basedquery in their implementation In contrast SybilQueryis computationally cheap and generates queries with sub-second latency Second PIR-based schemes require codemodifications at both the client and the server to implementthe PIR protocol Thus in contrast to SybilQuery PIR-basedschemes cannot easily be applied to legacy systems

bull Other related research Zhong et al [45] recently pro-posed a distributed k-anonymity scheme for location privacythat makes use of an additive homomorphic cryptosystemThe main limitation of such a system is that it requireslocation brokers(cellular andor WiFi providers) to be mod-ified in order to implement their scheme SpaceTwist [31]is a recently proposed location privacy system that uses acloaking-like scheme to obfuscate the location of the clientfrom the LBS and uses an incremental algorithm to processnearest neighbor queries Like SybilQuery SpaceTwistis also decentralized and autonomous However unlikeSybilQuery it does not ensure k-anonymity does not handlemobile users and only supports nearest neighbor queries

The idea of introducing synthetic information to achieveprivacy has previously been explored for anonymous Webaccess [24 38] In these systems a user query to aWeb server (eg a search request) is hidden amongstsynthetically generated queries Similarly Chuffnes etalrecently developed SwarmScreen [3] a system whichprotects user privacy in P2P systems by adding additionalrandom connections that are statistically indistinguishablefrom natural ones

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES
Page 10: Privately Querying Location-based Services with SybilQuery › ~iftode › ubicomp09.pdf · Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy,

Recently Machanavajjhala et al [33] developed rigoroustechniques based upon a variant of differential privacy [9] toadd synthetic information to released databases and appliedit to a database of traffic commuting patterns Adaptingthese techniques to enable real-time generation of syntheticqueries is an interesting direction for future work

SUMMARY AND FUTURE WORKSybilQuery is an efficient decentralized technique to hideuser location from LBSs Its modular design allowsSybilQuery to be deployed with off-the-shelf client devicesand without any changes to the LBS We implemented aprototype of SybilQuery and integrated it with LBSs suchas Google Maps Yahoo Maps and Microsoft Live MapsOur experimentsmdashboth a qualitative user study as well asa performance evaluationmdashshow that Sybil queries can beefficiently generated and they resemble real user queries Infuture work we plan to enhance the SybilQuery prototypeto generate Sybil queries that also achieve stronger privacyguarantees such as `-diversity [32] t-closeness [30] anddifferential privacy [9]

ACKNOWLEDGEMENTSWe thank our shepherd Alex Varshavsky Arati BaligaMohan Dhawan Stephen Smaldone Rebecca Wright andall the anonymous reviewers for valuable feedback on thispaper We also thank all the participants of our userstudy The work has been supported in part by the NationalScience Foundation under awards CNS-0520123 and CNS-0831268

REFERENCES1 B Bamba L Liu P Pesti and T Wang Supporting anonymous location queries

in mobile environments with PrivacyGrid In Proc WWW 2008

2 Cabspotting project httpcabspottingorg

3 D Choffnes J Duch D Malmgren R Guierma F Bustamante and L AmaralSwarmscreen Privacy through plausible deniability in p2p systems InNorthwestern EECS Technical Report 2009

4 C-Y Chow and M F Mokbel Enabling private continuous queries for revealeduser locations In Proc SSTDrsquo07 Advances in Spatial and Temporal Databases2007

5 C-Y Chow M F Mokbel and X Liu A peer-to-peer spatial cloaking algorithmfor anonymous location-based services In Proc GISrsquo06 ACM InternationalSymposium on Advances in Geographic Information Systems 2006

6 B Cox D Evans A Filipi J Rowanhill W Hu j Davidson J KnightA Nguyen-Tuong and J Hiser N-Variant systems A secretless framework forsecurity through diversity In Proc USENIX Security Symposium 2006

7 F Diggelen Gnss accuracy Lies damn lies and statistics GPS World 2007

8 M Duckham and L Kulik A formal model of obfuscation and negotiation forlocation privacy In Proc Pervasive pp 152-170 2005

9 C Dwork Differential privacy In Proc ICALPrsquo06 Intl Colloquium onAutomata Languages and Programming 2006

10 EZPass A pass on privacy httpwwwnytimescom20050717magazine17WWLNhtml

11 EZPass records out cheaters in divorce courthttpwwwmsnbcmsncomid20216302

12 R Finkel and J Bentley Quad trees A data structure for retrieval on compositekeys Acta Informatica 4(1)1ndash9 1974

13 B Gedik and L Liu Location privacy in mobile systems A personalizedanonymization model In Proc ICDCS 2005

14 B Gedik and L Liu Protecting location privacy with personalized k-anonymityArchitecture and algorithms IEEE Transactions on Mobile Computing 2007

15 G Ghinita P Kalnis A Khoshgozaran C Shahabi and K-L Tan Privatequeries in location-based services Anonymizers are not necessary In ProcACM SIGMOD 2008

16 G Ghinita P Kalnis and S Skiadopoulos Mobihide A mobile peer-to-peersystem for anonymous location-based queries In Proc SSTDrsquo07 InternationalSymposium on Spatial and Temporal Databases 2007

17 G Ghinita P Kalnis and S Skiadopoulos PRIVE Anonymous location-basedqueries in distributed mobile systems In Proc WWW 2007

18 Google maps geo API httpcodegooglecomapismaps

19 M Gruteser and D Grunwald Anonymous usage of location-based servicesthrough spatial and temporal cloaking In Proc Mobisys 2003

20 K H Y Yanagisawa and T Satoh An anonymous communication techniqueusing dummies for location-based services In Proc IEEE InternationalConference on Pervasive Services 2005

21 K H Y Yanagisawa and T Satoh Protection of location privacy using dummiesfor location-based services In Proc ICDE Workshops 2005

22 U Hengartner Hiding location information from location-based services InProc International Workshop on Privacy-Aware Location-based MobileServices (PALMS) 2007

23 B Hoh and M Gruteser Location privacy through path confusion In ProcSecureComm 2005

24 D Howe and H Nissenbaum TrackMeNot Resisting surveillance in Websearch On the Identity Trail Privacy Anonymity and Identity in a NetworkedSociety (Oxford University Press) 2008

25 R R C J Meyerowitz Realtime location privacy via mobility predictionCreating confusion at crossroads In Proc HotMobile 2009

26 P Kalnis G Ghinita K Mouratidis and D Papadias Preventing location-basedidentity inference in anonymous spatial queries IEEE Transactions onKnowledge and Data Engineering 19(12)1719ndash1733 2007

27 J Krumm Realistic driving trips for location privacy In Proc Pervasive 2009

28 J Krumm and E Horvitz Predestination Where do you want to go today InIEEE Computer Magazine vol 40 no 4 pp 105-107 2007

29 E Kushilevitz and R Ostrovsky Replication is not needed Single databasecomputationally-private information retrieval In Proc FOCSrsquo97 IEEESymposium on Foundations of Computer Science 1997

30 N Li T Li and S Venkatasubramanian t-closeness Privacy beyondk-anonymity and l-diversity In Proc ICDE 2007

31 M L Liu C S Jensen X Huang and H Lu Spacetwist Managing thetradeoffs among location privacy query performance and query accuracy inmobile services In Proc ICDE 2008

32 A Machanavajjhala J Gehrke D Kifer and M Venkitasubramaniam`-Diversity Privacy beyond k-anonymity In Proc ICDE 2006

33 A Machanavajjhala D Kifer J Abowd J Gehrke and L Vilhuber PrivacyTheory meets practice on the map In Proc ICDE 2008

34 M F Mokbel C-Y Chow and W G Aref The new Casper Query processingfor location services without compromising privacy In Proc VLDB 2006

35 Microsoft multimap API httpwwwmultimapcom

36 NYC cabs strike over GPS system planshttpwwwengadgetcom20070309nyc-cab-drivers-say-no-thanks-to-gps-installation

37 M K Reiter and A D Rubin Crowds Anonymity for Web transactions ACMTransactions on Information and System Security 1(1)66ndash92 1998

38 F Saint-Jean A Johnson D Boneh and J Feigenbaum Private Web search InProc WPESrsquo07 ACM Workshop on Privacy in the Electronic Society 2007

39 P Samarati Protecting respondents identities in microdata release IEEETransactions on Knowledge and Data Engineering 13(6)1010ndash1027 2001

40 P Shankar V Ganapathy and L Iftode Privately querying location-basedservices with sybilquery In Rutgers University Technical Report DCS-TR-6522009

41 R Sion and B Carbunar On the computational practicality of privateinformation retrieval In Proc NDSS 2007

42 L Sweeney k-anonymity A model for protecting privacy International Journalon Uncertainty Fuzziness and Knowledge-based Systems 10(5)557ndash570 2002

43 Yahoo maps local search API httpdeveloperyahoocommaps

44 Yahoo Local httptrafficyahoocomtraffic

45 G Zhong and U Hengartner A distributed k-anonymity protocol for locationprivacy In Proc IEEE PerCom 2009

  • Introduction
  • Problem Definition and Assumptions
  • Hiding Client Location using Sybil Queries
  • The SybilQuery Prototype
    • The Endpoint Generator
    • The Path Generator
    • The Query Generator
    • Example
      • Evaluation
        • Privacy
        • Performance
          • Related Work
          • Summary and Future Work
          • Acknowledgements
          • REFERENCES