GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy...

11
GraphProtector: A Visual Interface for Employing and Assessing Multiple Privacy Preserving Graph Algorithms Xumeng Wang, Wei Chen, Jia-Kai Chou, Chris Bryan, Huihua Guan, Wenlong Chen, Rusheng Pan and Kwan-Liu Ma 49 280 1 Degree Closeness Betweenness Degree Closeness Betweenness Other nodes 0.3~0.8 0~0.5 0~0.5 0.8~1 0.99~1 0.99~1 Processing priority High Low 3 Degree Closeness Betweenness 0.99~1 0~1 0~1 PRIORITY UTILITY Metrics Joint degree Clustering coefficient GRAPH PROTECTOR Degree Protector K: 2 Hub Fingerprint Protector K: 5 PROVENANCE 2 1 Subgraph Protector K: 2 ID DEG CLO BET Selected 56 77 0.369 0.020 67 75 0.372 0.018 271 72 0.368 0.015 322 71 0.376 0.043 25 71 0.381 0.085 21 67 0.365 0.033 277 64 0.413 0.266 26 67 0.365 0.010 252 64 0.361 0.011 Remove empty nodes STEP 1: Degree STEP 2: Hub fingerprint Subgraph CURRENT RESULT Degree Hub fingerprint 1 6 11 16 21 26 33 39 45 55 64 75 Node Amount Degree 10 8 6 4 2 0 Classic External 2 detected subgraphs Number of nodes: 6 Complete graph Clustering coefficient Joint degree Added edges (Euclidean) 0.509 0.2% 58.975 10 0.4% Clustering coefficient Joint degree Added edges (Euclidean) 0.510 0.0% 26.796 1 0.0% Clustering coefficient Joint degree Added edges (Euclidean) 0.510 0.3% 63.937 11 0.4% D1 D2 (a) (b) (c) (d) Fig. 1. The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving schemes. (b) The provenance view illustrates the effect caused by each process. (c) The utility view lists user-selected utility metrics. (d) The priority view depicts the processing priority for nodes/identities specified by users. Abstract— Analyzing social networks reveals the relationships between individuals and groups in the data. However, such analysis can also lead to privacy exposure (whether intentionally or inadvertently): leaking the real-world identity of ostensibly anonymous individuals. Most sanitization strategies modify thegraph’s structure based on hypothesized tactics that an adversary would employ. While combining multiple anonymization schemes provides a more comprehensive privacy protection, deciding the appropriate set of techniques—along with evaluating how applying the strategies will affect the utility of the anonymized results—remains a significant challenge. To address this problem, we introduce GraphProtector, a visual interface that guides a user through a privacy preservation pipeline. GraphProtector enables multiple privacy protection schemes which can be simultaneously combined together as a hybrid approach. To demonstrate the effectiveness of GraphProtector , we report several case studies and feedback collected from interviews with expert users in various scenarios. Index Terms—Graph privacy; k–anonymity; structural features; privacy preservation 1 I NTRODUCTION • Xumeng Wang, Wei Chen, Wenlong Chen and Rusheng Pan are with Zhejiang University. E-mail: [email protected]; [email protected]; {chenwenlong, panrusheng}@zju.edu.cn. Wei Chen is corresponding author. • Jia-Kai Chou, Chris Bryan and Kwan-Liu Ma are with University of California, Davis. E-mail: {jkchou, cjbryan}@ucdavis.edu, [email protected]. • Huihua Guan is with Zhejiang University and Alibaba Group. E-mail: [email protected]. Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publication xx xxx. 201x; date of current version xx xxx. 201x. For information on obtaining reprints of this article, please send e-mail to: [email protected]. Digital Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx In today’s world, intra-personal information is regularly collected and modeled using social networks. Analyzing the structures of these networks provides insights in the relationships of individuals within populations. Among the social sciences, network research serves many purposes, such as detecting communities [52], identifying lead influencers [40], and examining the spread of information [35]. In corporate environments, network analysis provides support for business-critical operations [36], including targeted marketing [7, 20], crime and fraud detection [7, 16], and behavioral analysis of customers [26, 51]. Datasets built using individuals often contain sensitive information. By analyzing the network, the specific identities and attributes of ostensibly anonymous individuals can be (either intentionally or inadvertently) compromised [50]. We refer to this as privacy exposure or leaking. It is important to perform proper privacy preserving

Transcript of GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy...

Page 1: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

GraphProtector: A Visual Interface for Employing and AssessingMultiple Privacy Preserving Graph Algorithms

Xumeng Wang, Wei Chen, Jia-Kai Chou, Chris Bryan, Huihua Guan, Wenlong Chen, Rusheng Pan and Kwan-Liu Ma

49

280

1

DegreeClosenessBetweenness

DegreeClosenessBetweenness

Other nodes

0.3~0.80~0.50~0.5

0.8~10.99~10.99~1

Pro

ce

ssin

g p

riority

Hig

hL

ow

3DegreeClosenessBetweenness

0.99~10~10~1

PRIORITY

UTILITY

MetricsJoint degree

Clustering coefficient

GRAPH PROTECTORDegree Protector K: 2

Hub Fingerprint Protector K: 5

PROVENANCE

2

1

Subgraph Protector K: 2

ID DEG CLO BET Selected

56 77 0.369 0.020

67 75 0.372 0.018

271 72 0.368 0.015

322 71 0.376 0.043

25 71 0.381 0.085

21 67 0.365 0.033

277 64 0.413 0.266

26 67 0.365 0.010

252 64 0.361 0.011

Remove empty nodes

STEP 1: Degree

STEP 2: Hub fingerprint

Subgraph

CURRENT RESULT

Degree Hub fingerprint

1 6 11 16 21 26 33 39 45 55 64 75

Node Amount

Degree

10

8

6

4

2

0

Classic External 2 detected subgraphs

Number of nodes: 6

Complete graph

Clustering coefficient

Joint degree

Added edges

(Euclidean)

0.509 0.2%

58.975

10 0.4%

Clustering coefficient

Joint degree

Added edges

(Euclidean)

0.510 0.0%

26.796

1 0.0%

Clustering coefficient

Joint degree

Added edges

(Euclidean)

0.510 0.3%

63.937

11 0.4%

D1

D2

(a) (b) (c)

(d)

Fig. 1. The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preservingschemes. (b) The provenance view illustrates the effect caused by each process. (c) The utility view lists user-selected utility metrics.(d) The priority view depicts the processing priority for nodes/identities specified by users.

Abstract— Analyzing social networks reveals the relationships between individuals and groups in the data. However, such analysiscan also lead to privacy exposure (whether intentionally or inadvertently): leaking the real-world identity of ostensibly anonymousindividuals. Most sanitization strategies modify the graph’s structure based on hypothesized tactics that an adversary would employ.While combining multiple anonymization schemes provides a more comprehensive privacy protection, deciding the appropriate set oftechniques—along with evaluating how applying the strategies will affect the utility of the anonymized results—remains a significantchallenge. To address this problem, we introduce GraphProtector, a visual interface that guides a user through a privacy preservationpipeline. GraphProtector enables multiple privacy protection schemes which can be simultaneously combined together as a hybridapproach. To demonstrate the effectiveness of GraphProtector , we report several case studies and feedback collected from interviewswith expert users in various scenarios.

Index Terms—Graph privacy; k–anonymity; structural features; privacy preservation

1 INTRODUCTION

• Xumeng Wang, Wei Chen, Wenlong Chen and Rusheng Pan are withZhejiang University. E-mail: [email protected];[email protected]; {chenwenlong, panrusheng}@zju.edu.cn. WeiChen is corresponding author.

• Jia-Kai Chou, Chris Bryan and Kwan-Liu Ma are with University ofCalifornia, Davis. E-mail: {jkchou, cjbryan}@ucdavis.edu,[email protected].

• Huihua Guan is with Zhejiang University and Alibaba Group. E-mail:[email protected].

Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publicationxx xxx. 201x; date of current version xx xxx. 201x. For information onobtaining reprints of this article, please send e-mail to: [email protected] Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx

In today’s world, intra-personal information is regularly collectedand modeled using social networks. Analyzing the structures ofthese networks provides insights in the relationships of individualswithin populations. Among the social sciences, network researchserves many purposes, such as detecting communities [52], identifyinglead influencers [40], and examining the spread of information [35].In corporate environments, network analysis provides support forbusiness-critical operations [36], including targeted marketing [7,20], crime and fraud detection [7, 16], and behavioral analysis ofcustomers [26, 51].

Datasets built using individuals often contain sensitive information.By analyzing the network, the specific identities and attributes ofostensibly anonymous individuals can be (either intentionally orinadvertently) compromised [50]. We refer to this as privacy exposureor leaking. It is important to perform proper privacy preserving

Page 2: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

operations before the data can be shared or released.Privacy preservation schemes are anonymization approaches that

counteract hypothesized adversarial attack models. A commonapproach for network sanitization is to extend the well-knownk–anonymity model [45]: assuming a closed-world, in which noexternal dataset/information is considered known by the attackers,privacy is exposed when a localized structure, property, or group ofindividuals can be uniquely identified in a given network dataset.

Most anonymization models, including k–anonymity, focus onspecific subsets of privacy specification. As a mechanism for sanitizinga dataset, this means they are only able to handle the corresponding typeof attack(s). Considering network privacy in a more comprehensivemanner by simultaneously combining multiple anonymization modelsis a trending research direction [27]. A significant challenge inincorporating multiple models is comparing the effectiveness ofdifferent sanitization schemes. There are two reasons such comparisonis helpful: (1) It provides understanding for how privacy preservationis being accomplished as an event-driven pipeline of anonymizationactions. (2) It makes visible the change in utility in transitioning froma raw, leaking dataset to one that is processed and sanitized. Privacyrequirements and utility measurements are highly dependent on datasetcontext and user semantics. Thus, providing flexibility and transparencyin the privacy preservation process is a crucial but non-trivial task.

To help accommodate this challenge, we present GraphProtector,a novel system for preserving privacy in network datasets viathe interactive synthesis and application of multiple anonymizationmodels. Figure 2 shows the workflow for GraphProtector, whichemploys several linked views with data-driven, backend heuristics in amulti-stage, user-in-the-loop, sanitization pipeline.

GraphProtector allows a user to load a network dataset (in thispaper, we focus on simple graphs) and then employ one or more privacypreserving mechanisms (which we refer to as protectors) to conducttargeted dataset anonymization. Protectors act based on a customizableset of identity priorities and utility metrics. That is, when a protectorfinds privacy leaks in the dataset, its removal scheme is based on auser-defined set of specifications. Protector recommendations can becustomized and compared against each other, allowing for flexibleimplementation of dataset sanitization. A provenance component toGraphProtector provides a historical view of dataset change over thecourse of applying multiple protectors, allowing a user to see exactlyhow the dataset’s utility and privacy exposure is changing.

GraphProtector is designed around a set of task requirements fromdiscussion with domain researchers. We present two case studies ofreal-world datasets that demonstrate how the system can be used.We additionally interview several domain experts in privacy andrelated research areas and discuss collected feedback. Their commentsshed light on how tools that accommodate visualization-based,user-in-the-loop workflows like GraphProtector can be integratedinto their normal analysis routines. As an initial attempt at visuallyestablishing complex, custom privacy protections for graph-baseddatasets. Our results open up opportunities for expanded researchin the domain, including open-world considerations, more complexanonymization models, and multiple privacy preserving operations.

2 RELATED WORK

Privacy Preservation for Graphs. Node-link diagrams, also calledgraphs, are visual representations of network datasets. For an overviewof general graph visualization, see [34, 44, 47].

With the boosting of data collection techniques, researchers nowextend classical models for privacy preservation to graph data [56].Based on differential privacy models, some studies chose to makeappropriate perturbations to the graph structure [41, 49, 53]. Forinstance, Xiao et al. [53] abstracted a dendrogram from a simplegraph according to hierarchical random graph (HRG) model. Inreconstructing the original graph based on the dendrogram, they applieddifferential privacy methods to add noise so that the privacy issues couldbe removed effectively.

Differential privacy approaches are vulnerable to many variantsof de-anonymization attacks, while models based on k–anonymity

may conditionally resist them [27]. K–anonymity [45] anonymizesindividuals by making at least k individuals indistinguishable, wherek is an adjustable, user-defined parameter: the higher the k, the higherthe enabled privacy preservation. The original model, designed fortabular data, achieves anonymization by obscuring the attribute values.Later studies regard structural features as a kind of attribute, and haveextended this model to deal with privacy issues in graph data, e.g.k-degree [33], k-automorphism [57], and k-isomorphism [11]. Thesemodels anonymize structural features in the graph by constructingequivalent features. Attackers are thus unable to identify which one isthe target. Results generated by these models have an adequate abilityto resist partial attacks, but not all kinds of attacks.

In addition to differential and anonymity approaches, other modelsanonymize based on features of the graph itself. Ying and Wu [55]edit the graph by randomly adding, deleting, and switching edges.Thompson et al. [46] present two algorithms, bounded t-meansclustering and union-split, that leverage clustering to achieve privacypreservation by graph clustering methods.Privacy-Awareness for Other Visualization Techniques. Forgeneral visualization research that integrates privacy awareness (bothsanitization and detection), common approaches fall into two categories:(1) preserving privacy in the visualization itself (similar to theaforementioned graph-based work), and (2) using visualization tofacilitate a privacy preservation pipeline.

For the first approach, privacy sanitization is applied directly to thevisualization, creating an updated chart that is sanitized. As an example,Dasgupta and Kosara [15] extend parallel coordinates by employing ascreen-space sanitization technique. Only an approximate interval ofdata values for polylines can be recognized, thus meeting a k–anonymitythreshold. Likewise for summary histograms, Archambault et al. [3]aggregate components. Considering user diversity, Oksanen et al. [39]visualize trajectory data by privacy preserving heatmaps.

In other instances, privacy preservation must be accomplished ina more complex or analytic-specific manner. Thus, interactive anditerative processing is necessary to conduct an action-based pipelinefor updating the dataset and visualization. Visualization exists withina system both to analyze and review the state of privacy preservation.This approach has recently been used by Chou et al. to handle eventsequence datasets and [13] and ontological social network data [12].The systems provide users with suggested obfuscations for sanitizingdata based on syntactic anonymity models. Similarly, work by Wang etal. [48] allows a data owner to share sanitized data directly based on adesired level of privacy preservation.

Like these latter systems, GraphProtector embeds visualization asa way to provide exploration, understanding, and provenance into acustom sanitization pipeline.Evaluating Privacy Preservation. No matter how sanitization isaccomplished, its effectiveness can be evaluated mainly by measuringthe resultant amount of privacy preservation and the change invisualization utility.

To verify the effect of privacy preservation, de-anonymizationalgorithms can be used to simulate attacks. For example, ink–anonymity approaches, the parameter k reflects the minimum amountof privacy preservation directly. For simple graphs (which we focus onin this paper), attacks usually identify sub-graphs and other structureswithin the network by querying for specific features [24, 28, 30, 37, 38].The proportion of individual nodes that can be de-anonymized measuresthe amount of privacy preservation.

Unfortunately, privacy protection always comes at the cost ofutility loss. For graphs, there are a variety of widely used ways tomeasure utility, including node degree, centrality, graph diameter, andaverage path length. These features can be represented in differentforms: a value generalizes the status of the entire graph [42], avector illustrates the specialty of each node [6, 8, 18, 19] and a matrixreflects the relationships between each pair of nodes [23]. For morespecialized scenarios, utility metrics can be directed for specific tasks,like community detection [54] and role extraction [25].

Page 3: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

3 TASK REQUIREMENTS

We focus on simple graphs because they lay the foundation for complexgraphs. Also called strict graph, simple graphs are unweighted andundirected graphs that do no contain graph loops or multiple edges [21].Attacks against simple graphs may come from the attribute values ofthe individuals represented by nodes or the relation links indicated bythe graph structure. Actually, the relationships in some graph data mayinvolves more information, which lead to weighted, directed or multipleedges. We will leave this as future work.

Our targeted users are those who work with sensitive networkdatasets, require flexible and personalized privacy preservation, andare knowledgeable about anonymization approaches for graphs.This means supporting customized solutions that provide sufficientprotection while minimizing loss of overall dataset utility. Followingthe definition of Ji et al. [27], we define “utility” as the similarity ofstructural properties.

To drive our design and help define goals, we initially consulted overa series of meetings with a domain researcher who regularly works withnetwork datasets and studies privacy preserving algorithms for graphdata. Based on this, we define a set of four task requirements (TR) forbuilding a visual privacy preserving system:

TR1: Learn about the characteristics of the graph. Makingsense of a graph that contains hundreds or thousands of nodes andedges is challenging, even when leveraging advanced visualizationtechniques like optimized layouts and interaction. Augmenting the basenode-link diagram with additional statistical views (node distributionsby property, etc.) provides a better holistic overview of the dataset andits properties.

TR2: Set priority before applying privacy preservingalgorithms. Before modifying the graph, users should have theflexibility to set the priorities or constraints with respect to which nodesare “touched” first. While different sets of anonymization operations,i.e., modifying different parts of a graph, may fulfill the same privacyrequirement, some of them can introduce a significant and unwantedchange to the characteristic of the graph (or a specific set of nodes).Depending on a user’s needs, a region of the graph that meets certainstructural criteria or contains important information should be kept asintact as possible.

TR3: Evaluate and compare schemes. There are many potentialgraph modifications that can fix a privacy leak, depending on whichalgorithm(s) is used, which nodes or edges are touched, and how theyare updated. Researchers may choose one of several different schemes,either using one single or multiple privacy preserving algorithms. Beingable to compare different schemes allows the user to determine themost appropriate action. This also includes understanding what levelof protection each scheme provides, what steps the scheme takes, andwhat is the utility cost of achieving the protection.

TR4: Record the provenance of graph modifications. Privacypreserving operations modify the topology of a graph, introducinguncertainty and invariably degrading the graph’s resultant utility. It isimportant to allow users to keep track of the changes that have beenmade, allowing them to go back and try new schemes if the currentresult is deemed unsatisfactory.

4 PRIVACY PRESERVATION APPROACH AND WORKFLOW

We outline the privacy preservation approach employed inGraphProtector and describe at a high-level its multi-stage,user-in-the-loop workflow (shown in Figure 2) to identify and sanitizeleaks. In Section 5, we describe in detail how GraphProtector’sinterfaces and interactions facilitate this workflow.

4.1 Structural Features that Expose Privacy

We consider three types of structural features—introduced by Hay etal. [24]—that can be queried to compromise the privacy of individualsin simple graphs: node degree, hub fingerprint, and subgraph. With thedevelopment of technology, graph privacy may face more threats. Inthis work (as a beginning study), we focus on the three basic issues.

Node degree. The degree (or valency) of a node is the number ofedges connected to that node. Node degree distribution in networks isusually uneven and skewed, tailing from similarly weighted nodes tooutliers. By submitting queries that return nodes with a specific degree,attackers may be able to identify individuals with a unique node degreevalue in the network.

Hub Fingerprint. A hub refers to a node that contains extremeor unique feature values. Examples of hubs include nodes with highdegree or nodes that exist on many shortest paths throughout the graph(i.e., they bridge clusters in the graph or are centrally located).

In many real-world social networks, hubs are high-profile individualslike celebrities or government officials. It is challenging to mask theidentity of hubs since their “fingerprints” are anomalous compared toother nodes. Attackers can leverage a node’s connection to one or morehubs to expose its identity. This is known as a hub fingerprint attack.

Subgraph. A subgraph is a subset of connected nodes extractedfrom the network. A common attack involving subgraphs is to firstembed a subgraph with a specifically designed structural connectionto a target graph, then utilize that knowledge to initiate other privacyattacks [5]. If not enough instances of a “sensitive” subgraph exist inthe network, it constitutes a privacy leak.

As a part of GraphProtector’s workflow, we design “protectors”(Figure 2(c)) for each of these privacy issues to identify leaks—seeSection 5.2.1 for descriptions.

4.2 Defining a Privacy Preserving ModelWhen structural features in a graph are identified as exposing privacy,there are various ways to achieve anonymization. Common methodsinclude adding/swapping nodes and edges (introducing uncertainty),merging nodes (generalization), and removing nodes and edges(suppression) [33].

In our case, we are focusing on protecting the identities ofpersons (nodes). To avoid exerting excessive effects on individualsby introducing “dummy” nodes, we only allow edges to bemodified—specifically, edge addition. By considering only one typeof operation, we reduce computational complexity and avoid potentialoperational conflicts. Compared to edge deletion, adding edgesprovides an additional benefit: the effect of deleting edges can onlybe uni-directional, i.e., it always removes information. In contrast,adding edges not only introduces previously non-existed information,in some cases it also removes some information. For example, addingedges to all the nodes that have a specific node degree can removesuch information in the resultant graph. As a potential future work, wenote that combining multiple sanitization techniques can and shouldcertainly be explored. However, this is a non-trivial task requiringcareful design and evaluation.

Two approaches can be used to provide privacy protection. Givena set of privacy leaking nodes: (1) make structural changes to someof the non-privacy leaking nodes so that at least k sets of nodes obtainthe same features, or (2) make structural changes to the set of privacyleaking nodes so that their structural features are equivalent to anotherset of nodes in the graph. GraphProtector supports both approaches.Before a protection action is invoked, the system checks to guaranteethat no additional privacy issue is introduced upon making structuralchanges.

As an example for the first approach, assume 10–anonymity isrequired for a set of 7 privacy leaking nodes with degree = 9. Toensure protection, 3 non-privacy leaking nodes must be updated toalso have degree = 9. Since protection is achieved by adding edges,nodes with lower degree have processing priority. To minimize the totalnecessary modifications, non-privacy leaking nodes with degree = 8are first checked, followed by nodes with degree = 7, and so on. Thisprocess continues until enough candidate nodes are found and edgesare added between them.

The second method adds edges directly to the set of privacy leakingnodes. Using the above scenario, we progressively add edges to the 7privacy leaking nodes with degree = 9 (making their degrees = 10,11,or more) until the privacy requirement is met.

Page 4: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

(d) Processed data

The node-link graph

Processing report

Node Indicator Change

1

2

+1.9%

-2.3%

26

31

3 0.0%3

4 -0.4%16

(a) Original data

The node-link graph

Distribution charts

Ranking: 50%Value: 45

Specifying utility metrics

Specifying node priority

(b) Visual speci�cation of priority and utility

Processing priorityHigh

Low

Do not handlethese individuals

Prioritize theseindividuals

Try not to modifythese individuals

Metrics Descriptions Selected

Degree

Diameter

Protector 3Protector 2

(c) Visual privacy preservation

1) Identifying risk 2) Specifying scheme

3) Comparing schemes 4) Executing scheme

Employ a series of protectors on different issues

Protector 1

K line

Scheme 1

Scheme 2

Utility

Scheme 3

Privacy

Fig. 2. The GraphProtector workflow: (a) A network dataset is loaded and its relationships and properties are visualized. (b) Desired node priorityand utility metrics are specified to direct the subsequent anonymization actions. (c) One or more protectors are employed which modify the dataset’stopology to preserve privacy. (d) The updated, processed dataset with an accompanying processing report.

Since GraphProtector supports both schemes, to choose an approachthe method that requires the least number of edge additions is selected.The pseudocode of this process is provided in Appendix I.

4.3 WorkflowAt a high-level, GraphProtector’s workflow links together backendalgorithms for detecting, evaluating, and sanitizing privacy leaks instructural features via several linked, interactive views. Figure 2shows the major steps and interactions in the workflow. We alsodenote the specific tasks from Section 3 that each step in the workflowaccommodates.

Original Data. First, a network dataset is loaded intoGraphProtector. The node-link diagram is shown at the beginningto facilitate users in selecting the metrics they need. Combined withthe statistical distribution charts of selected metrics, users can quicklygrasp the characteristic of the input network (TR1).

Visual Specification of Priority and Utility. To set the priority ofoperations prior to applying privacy preserving algorithms (TR2), setsof nodes can be grouped into bins. These bins are manually orderedaccording to their processing priority, which denotes the order thatthey will be first considered as candidate nodes applicable for privacypreserving operations. Setting lower processing priority to importantnodes reduces the possibility that they will be directly affected bytopology modifications. Particularly significant nodes can even belocked to prevent from modification.

Prior to applying privacy preserving algorithms, users can setprocessing priorities to the nodes in the network (TR2). Nodes can begrouped into bins, while nodes in the same bin are considered having thesame priority. Setting a lower processing priority to a set of importantnodes can reduce the possibility of those nodes being affected by theprivacy preserving operations. Particularly significant nodes can evenbe locked to prevent from modification.

For example, a user might wish to “prioritize the 30% of nodes withthe lowest degree,” or “if possible, do not touch the 1% of nodes withthe highest betweenness.” When the number of feasible operations forhandling a leak is more than one (which is usually the case), decisionsare made according to this user-defined priority. In order to guaranteethe solution, a user should not lock an excessive number of nodes, butinstead define a detailed list of processing priorities.

In addition to priority, the user can specify one or more utility metricsto evaluate the effects of potential sanitization actions. “Number ofadded edges” is the default metric, but there are many ways to quantifytopology modifications (see Section 5.1.3). Topology modifications

induced by privacy preserving operations are expressed as vectors;the updated graph is compared to the prior via a high-dimensionalsimilarity calculation.

Visual Privacy Preservation. At this point, the user employsone or more protectors to sanitize the dataset. This process usesinteractive decision-making based on automatic identification of leakingstructural features (degree, hub fingerprint, subgraph) with applicationof our privacy preserving model to modify the graph’s topology and fixspecific leaks.

Since each protector type applies to different structural features, itdisassembles the process of graph modifications into steps. A usermay also employ more than one of each type of protector, for example,sanitizing multiple subgraphs, or invoke them in any desired order.

Protectors follow a standard framework, shown in Figure 2(c)). Afterthe protector identifies the risk(s), the user customizes a scheme to guidethe privacy preserving algorithm. Evaluating and comparing schemesallows for the “best” one (according to the user’s semantics) to bepicked (TR3). For example, an unrealistic scheme may force sequentialprocessing to achieve the goal with a high price. If the user requiresthat “all the nodes have the same degree,” the graph’s original degreedistribution will be completely destroyed and all utility lost. Therefore,the user should select and execute a more targeted scheme that onlyupdates the graph such that the affected structural feature no longerexposes privacy. As schemes are executed, the graph is updated toprovide a provenance view of prior executed protectors (TR4).

Application of protectors follows a unified flow, shown inFigure 2(c)). First, privacy risks are identified based on the user-definedk value(s). Then, the user customizes a scheme or multiple schemes toguide the privacy preserving algorithm. Next, performing comparisonand evaluation to the schemes allows for the “best” one (according tothe user’s semantics) to be picked (TR3). For example, an unrealisticscheme may force sequential processing to achieve the goal with a highprice. If the user requires that “all the nodes have the same degree,”the graph’s original degree distribution will be completely destroyed.Therefore, the user should select and execute a more targeted schemethat only updates the graph such that the affected structural feature nolonger exposes privacy. As schemes are executed, the graph is updatedto provide a provenance view of prior executed protectors (TR4).

Processed Data. Once the user is satisfied that the network issanitized, the “final version” can be exported and reviewed (TR3).This includes the updated dataset (the node-link graph) as well as aprocessing report of changes.

Page 5: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

PRIORITY

UTILITYMetrics

Joint degree

Diameter

Graph

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Betweenness(0~0.5)

0.25

0.20

0.15

0.10

0.05

0.00

Amount (49/167)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Degree(0.3~0.8)

70

60

50

40

30

20

10

0

Amount (49/173)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Closeness(0~0.5)

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0.00

Amount (49/167)

49

280

1

DegreeClosenessBetweenness

DegreeClosenessBetweenness

Other nodes

0.3~0.80~0.50~0.5

0.8~10.99~10.99~1

Pro

ce

ssin

g p

riority

Hig

hL

ow

3DegreeClosenessBetweenness

0.99~10~10~1

Path length

Degree

Descriptions

The number of edges connecting nodes of degree k with nodes of degree l and describes the degree correlations.

The maximum eccentricity.

The average shortest path length.

The degree vector.

Selected

Evaluate metrics similarity by: Cosine similartiy

New links

Locked nodes

Highlighted nodes

Highlighted && locked nodes

(a)

(c)

(b)

Fig. 3. The interface for visual specification of priority and utility is composed of three views: (a) the node-link view, rendering the overview of thegraph, (b) the utility view and (c) the priority view for specifying node priority.

5 SYSTEM DESIGN

We now describe in detail the primary views and interactions inGraphProtector. The system is based around two interfaces (Figures 1and 3) and five linked views. Throughout this section, we usea Facebook friendship dataset that contains 333 nodes and 2,519edges [32].

5.1 Interface for Specification of Priority and Utility

The graph interface contains three views: a visualization of the loadedgraph and two panels for specifying priority and utility metrics.

5.1.1 Graph View

The first view visualizes the network dataset using a node-link diagramand force-directed layout, as shown in Figure 3(a) (TR1). Nodes andedges in the graph are highlighted based on interactions and operationsmade during GraphProtector’s workflow. In addition, users are allowedto check the clustering results [22] when comparing the processed dataand original data.

5.1.2 Priority View

The priority view initializes by showing a list of derived features fornodes: degree, closeness, betweenness, eigenvector, and constraint.Selecting a set of these metrics creates a set of line charts showing thestatistical distributions of nodes for each selection (TR1). For example,in Figure 3(b), the user has selected node degree, closeness, andbetweenness centrality. Each chart displays the relationship betweenthe metric’s values (y–axis) and the percentile rankings of the nodes(x–axis).

By brushing on the line charts, the user can specify a set of nodes(based on the intersection of brushes). This set can be created as a binfor establishing node priority (TR2), which is placed in the verticalstack to the right of the line charts. (By default, all nodes start out inthe same box and the stack only contains this one box.) As new boxesare created and added to the stack, the user can drag and drop boxesto switch their priority ordering. Boxes below the blue “lock ” lineare locked—nodes in the boxes will not be changed by any privacypreservation operation.

5.1.3 Utility ViewThe utility view, shown in Figure 3(c), lists out the utility metricsthat can be employed to evaluate the change in utility after privacypreserving operations are applied to the dataset (TR3). The usercan select one or multiple metrics from the following: joint degree,diameter, path length, clustering coefficient, and node degree.

While structural modifications to the graph (i.e., adding edges) arestored as vectors, a similarity measure is required to evaluate the changein utility (with respect to the prior state of the graph). We provide fourcommon metrics for this: Euclidean distance, Manhattan distance,cosine similarity and Jaccard similarity.

5.2 Interface for Privacy PreservationAfter a dataset is loaded and the user has specified priority and utilitysettings, the Protector Interface can be opened. Here, visual privacypreservation is enacted and the user can examine the series of changesto the graph’s provenance. At any point, the user may also navigate backto the graph interface to view the current, updated view of the dataset,as changes are reflected in the graph view. The protector interfacecontains two panels: the graph protector view and the provenance view.

5.2.1 Graph Protector ViewProtectors are selected, configured, and executed using the graphprotector view (Figure 1(c)). To start, the user creates a desiredprotector by clicking the corresponding label at the top of the panel.Adding multiple protectors of the same type is possible by clickingon the same button multiple times. Though applying the protectorsfollows the same sequence of actions (Section 4.3), each type of theprotectors is individually designed according to the structural feature(s)it sanitizes.

Degree Protector. The degree protector shows the degreedistribution of nodes as a bar chart (as in Figure 1(a) and Figure 4). Asdegree distribution is usually skewed in graph datasets, this means thatsome degree bins will contain no nodes, introducing gaps in the chart.To address this, we remove gaps and add a “degree gaps” tick mark todenote jumps, as shown in Figure 4.

To set up a privacy protection scheme, with respect to node degreesin this case, the user first needs to assign the desired k value. To do that,

Page 6: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

we define two types of settings. First, a “global” kG setting enforces thek–anonymity for the entire node degree distribution (unless otherwisespecified). Brushing along the bars sets “local” kL values for the nodeswith selected ranges of degrees. The desired kL values for each ofthese localized regions can be adjusted independently. In Figure 4, theglobal kG is set to 2 and there are two local kL values set for nodes withrelatively low and high degrees (set to 6 and 1, respectively).

When interacting with the degree protector, the system highlightsdegree bins based on their status. Bins that do not meet k–anonymity(they are below their kG or kL line) are given a red border; locked nodescolored blue.

After specifying a scheme, the user can click on the “preview ”icon to see how the graph will be updated as if the scheme is acceptedand executed. Modifications that will happen to the graph are noted onthe bar chart using T-marks: to indicate an increase in the number ofnodes to a bin and to indicate a decrease. A more detailed examplecan be seen on the top of Figure 1(a).

Multiple schemes can be previewed and compared by clicking on the“camera ” icon. At this point, one of the schemes can be acceptedand executed.

10

8

6

4

2

0

Degree

Node Amount

Degree gaps

1 6 11 16 21 26 33 39 45 55 64 75

“k” line

Satisfied

Unsatisfied

Locked

Local “k” settings

Global “k” setting

Fig. 4. The module design for the degree protector.

Hub Fingerprint Protector. The hub fingerprint protector isshown below the degree protector in Figure 1(a). Opening this protectorinitially shows a table of nodes (left side of the protector). The usermust first identify a set of nodes that are to be considered as hubs.Usually, nodes with high degree or centrality are strong candidates.

Rows can be sorted by the metrics selected in the priority view,which are the metrics that the user considers as important to performthe analysis tasks. When nodes are toggled as hubs, they are pinnedto the top of the list (see Figure 5(a)). After each hub selection,GraphProtector re-generates the fingerprints of the other nodes in thegraph. Since a fingerprint for a node is determined by its relationshipwith hub nodes (i.e., is it connected or not?), we generate fingerprintsbased on whether a node is an incident to each of the hub nodes.

Based on the selection of hub nodes, we categorize incident(non-hub) nodes based on their relationships to the hubs. ForFigure 5(a), since there are three selected hubs, nodes are categorizedinto four groups: (1) no connection to any hubs, (2) connection to onehub, (3) connection to two hubs, or (4) connection to all three hubs.

This categorization is used to construct a hub tree, which shows howlikely a hub fingerprint attack could occur for the selected hubs. ForFigure 5(a), its corresponding hub tree is shown in Figure 5(c)). Thistree has four levels (based on the four categories), where each levelcorresponds to the relationship category (top level contains nodes withno relationship, etc.). At each row, squared bins are created to representall possible combinations of selected hubs at that level. For example,at the third row (level 2) of Figure 5(c), each squared bin correspondsto a possible combination of two hub nodes out of the three selectedhubs. Edges between bins in neighboring levels indicate that nodes ina fingerprint node of upper level (connected to lesser hub nodes) canbe moved to a fingerprint node of lower level (connected to more hubnodes) by adding edges to them.

The encoding of a fingerprint node is shown in Figure 5(b). Theencoding can be interpreted as a metaphor for filling water into a bottle.Encodings are similar to the degree protector: the dashed k line is thedesired water level, indicating the privacy level to be achieved. Theaccumulated height of the blue bar (the number of locked nodes) andthe gray bar (the number of nodes that are not locked) shows the totalnumber of nodes that have the same corresponding fingerprint. The

gray boxes on the right-hand side of the fingerprint node describe thetype of fingerprint. In Figure 5(b), we demonstrate a fingerprint nodethat contains nodes linked to only the third hub node.

(a) Selected hubs

(c) The hub tree

0

1

2

3

Nu

mb

er of co

nn

ected h

ub

s

(b) The encodings of fingerprint nodes

ID DEG CLO BET Selected

56 77 0.369 0.020

67 75 0.372 0.018

271 72 0.368 0.015

“k” line

Satisfied

Connected with related hubsLocked

Disconnected with related hubs

322 71 0.376 0.043322 71 0.376 0.043

Fig. 5. (a) The three nodes with highest degree are selected as hubs. (b)For each fingerprint nodes, the feature statistic is shown on the left andthe feature (connection with the hubs) is explained on the right. (c) Thehub tree constructed based on the three hubs shown in (a).

In some cases, many fingerprints are empty—that is, no nodes linkto the specific sets of hubs. Since it is redundant and can be misleadingat the time to display those empty fingerprint nodes, we provide amechanism to allow users to collapse those nodes to show a morecompact layout. Figure 6 demonstrates an example.

Like the degree protector, the hub fingerprint protector includesfunctionality to preview and compare schemes. Upon deciding on ascheme, the user can execute the scheme and update the graph.

Remove

empty sets

Empty sets

Fig. 6. The hub tree before and after removing empty sets.

Subgraph Protector. Subgraph support is fundamentallydifferent than degree and hub fingerprint, since subgraphs are based onsemantic knowledge about unique substructures within the dataset.Since the potential number of subgraphs in a network is almostcombinatorial, we require the user to know what subgraph structures tosanitize.

Upon opening this protector, the user initially specifies an exemplarsubgraph. This queries the graph and shows instances of matchedsubgraphs. Matched subgraphs can be inspected individually to checkwhether they contain any privacy issues. Exemplars can be derivedfrom classic types of subgraphs by modulating their parameters [10] orloaded from prepared files.

Instead of requiring exact matches, we employ an error-tolerant [17]approach for matching subgraphs. Tolerance is a parameter between 0and 1. For example, a subgraph of 4 nodes and 5 edges is consideredmatched to a complete graph with 4 nodes (and 6 edges) given atolerance value of 1/6, as shown in Figure 7. The method of VF2 [14]is employed to identify all subgraphs within the tolerance. Identifiedsubgraphs within the set tolerance value are called “detected subgraphs”.If the number of detected subgraphs is less than user-defined k, we thenkeep searching for similar subgraphs using a larger tolerance (whichare called “similar subgraphs”). Similar subgraphs are the candidatesubgraphs that can be used to protect the privacy of detected subgraphs.

Page 7: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

For similar subgraphs, we use dashed edges to indicate where edges willbe added to make them identical to the detected subgraphs. The usercan additionally review node information and highlight their positions,as shown in the subgraph protector in Figure 1(a).

(a) The input exemplar

(a complete graph (4))

(b) Two detected subgraphs

(with tolerances of 0 and 1/6)

(c) Two similar subgraphs

(with a tolerance of 1/3)

Fig. 7. Inputting the exemplar (a) with a matching tolerance as 1/6, thesubgraphs in (b) are qualified, while the subgraphs in (c) are not. Therelative positions of the nodes in the subgraphs are preserved.

5.2.2 Provenance View

The provenance view provides a summary of all prior executed privacypreserving operations (TR4). It consists of two parts. The first part is adensity-estimated plot which provides a quick overview of the graph.The second part shows the change of utility caused by each of theprivacy preserving schemes. On top of the density-estimated plot, werendered the edges added to the graph after applying the correspondingprivacy preserving scheme. We originally considered two approachesfor the density-estimated plot: kernel density estimation (KDE) [43]and curve density estimates (CDE) [31]. For our context, CDE is moresuitable as it displays edge distribution through a linear kernel and thusbetter retains the edges’ finer visual appearance (see comparison inFigure 8).

As shown in Figure 1(b), the step histories in the provenance viewshow the changes made by prior protector operations. The graph isrendered using CDE with added edges overlaid at each step. Changesfor utility metrics are shown to the right of each history view. Increasesor decreases in values are highlighted in green and red, respectively,while if the metric outputs a similarity score it is colored in blue. Thecurrent state of the graph is shown at the end of the provenance view.The user can see the specific changes made by a protector in the graphview by clicking on its corresponding provenance view. If a user isdissatisfied with the current result, the most recent operation can becancelled from the provenance view. If a user is satisfied with the stateof the processed dataset, it can be exported into a documentary report,containing the sanitized data and a post-processing report of changesmade.

(a) The node-link view (c) CDE(b) KDE

Fig. 8. Visualizing a graph with the graph view (a), KDE (b) and CDE (c).

6 CASE STUDIES

We present two case studies that demonstrate using GraphProtector.The first compares protector schemes against each other, while thesecond deals with setting processing priorities and utility preservationmetrics.

6.1 Email Communication DatasetThe email communication dataset in this first case study records 5,451email contacts among 1,133 users in a university [2], and is shown as anode-link diagram in Figure 9(a).

1129

4

DegreeBetweenness

Other nodes

0~0.50.8~1

Processing priorityH

ighLow

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Degree

(0~0.5)70

60

50

40

30

20

10

0

Amount (4/608)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Betweenness

(0.8~1)0.25

0.20

0.15

0.10

0.05

0.00

Amount (4/227)

(a) The node-link graph

(b) The priority setting

Fig. 9. The email communication dataset. (a) In the graph view, thelocked nodes are highlighted in blue. (b) In the priority view, the nodesthat have “degree less than 50th percentile” and “betweenness greaterthan 80th percentile” are selected by brushing.

To start, node degree and betweenness are chosen for specifyingpriority. Figure 9(b) shows the distributions of these two metrics. Bybrushing the two charts, we quickly find that only 4 nodes have a lowdegree and a high betweenness (Figure 9)). We then add these nodesto the processing priority list and lock them to ensure they are nottouched by protector operations. We also select the following metricsfor utility analysis: path length, joint degree, clustering coefficient andbetweenness.

After going to the interface for privacy preservation, we first createa node degree protector with two schemes. For the first scheme (S1), tosimulate a purely automatic model, we set the global k (the kG value) to5, as seen in Figure 10(a). For the second scheme (S2), we set kG = 3,but then brushes to assign a localized kL = 7 value for nodes with lowerdegree (below 30), as shown in Figure 10(b).

By comparing the utility changes required to execute each scheme(shown on the bar charts), we notice that S1 requires not only addingnearly double amount of edges, it also results in larger impact to theother utility measures. We therefore choose to perform S2.

Next, we open a subgraph protector and check if there are privacyissues caused by the following five types of classic subgraphs: complete,circle, path, star and complete bipartite. We input several generalizedexemplar subgraphs for each type and set the tolerance to 0. As aresult, we find a subgraph, Figure 11(b), that matches a pre-definedbipartite graph exemplar of size (6,3), as seen in Figure 11(a). Toachieve 2-anonymity for this case, we continue to search for matchingsubgraphs with a relaxed tolerance value. When the tolerance is setto 1/9, one additional matching subgraph, Figure 11(c), can be found.Adding two edges to the matched subgraph, as presented as the dashedlines in Figure 11(c), fixes the privacy leak.

6.2 Face-to-Face Contacts Dataset

For the second case study, we use a dataset describing face-to-facecontacts during an exhibition [1]. Edges represent conversations

Page 8: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

1 6 11 16 21 26 31 36 42 51

10

8

6

4

2

0

Degree

Node Amount

Degree

Node Amount

1 6 11 16 21 26 31 36 42 51

10

8

6

4

2

0

Added edge: 79 1.4%

Path length: 3.554 -1.5%

Joint degree: (cosine) 0.914Clustering coefficient: 0.225 2.2%Betweenness: (cosine) 0.958

Added edge: 40 0.7%

Path length: 3.579 -0.7%

Joint degree: (cosine) 0.974Clustering coefficient: 0.223 1.2%Betweenness: (cosine) 0.972

(a) The result of S1

(b) The result of S2

Fig. 10. Applying two schemes to the Email communication dataset. Thechanges in utility metrics show that S1 yield nearly double loss than S2.

(a) One input exemplar

(a complete bi-partite (3,6))

(b) One detected subgraph

(with a tolerance of 0)

(c) One similar subgraph

(with a tolerance of 1/9)

Fig. 11. An input subgraph and two matching subgraphs in the emailcommunication dataset. Dashed lines represent the edges to be addedto achieve 2-anonymity.

between visitors that lasted longer than 20 seconds. This social networkhas 410 nodes and 2,765 edges. To illustrate the importance of theprocessing priority, in this case, we compare the changes in utilitymetrics when using different priority schemes.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Percentile rank

50

40

30

20

10

0

Degree

Fig. 12. The line chart of node degree over percentile ranking.

Figure 12 indicates that the degree distribution of the dataset isuniform except for nodes with the highest degree. To further study thisdistribution, degree is selected as the utility evaluation metric. We thencreate two priority schemes: (1) nodes with “degree less than the 2ndpercentile” are locked (S3), and (2) nodes with “degree more than the98th percentile” are locked (S4). For each priority scheme, the sameprivacy operations are applied:Step 1: Using a degree protector, kG is set as 2.Step 2: Using a hub fingerprint protector, we choose the four nodeswith the highest closeness as hub nodes and sets kG = 5.

After Step 1, the results produced from the two schemes differgreatly, as shown in Figure 13(a). To perform privacy protection, S3needs to add 3 edges while S4 adds 33 edges. This is because some ofthe high-degree nodes have a unique degree. S3 allows GraphProtectorto add edges to those high-degree nodes, thus nodes with very highdegrees can be made to have the same degree with each other. Incontrast, S4 locks those high-degree nodes, meaning they cannot betouched. As a result, other nodes must be used to resolve privacy issues,which require the addition of more overall edges. As shown in thesecond CDE of Figure 13(a), a large number of edges are added to twonodes, in order to increase their degrees. This significantly affects the

degree distribution of those low-degree nodes, as shown in the secondbar chart of Figure 13(a).

The two schemes add similar numbers of edges in Step 2. However,results of CDEs (Figure 13(b)) show that S4 adds 9 edges to thesame hub node. This modification would significantly undermine thecharacteristics of the hub with high closeness.

(b) The results of Step 2

(a) The results of Step 1

S3:

S4:

S3:

S4:

Add 3 edges

Add 6 edges

Add 33 edges

Add 9 edges

1 6 11 16 21 26 31 47Degree

30

20

10

0

Node Amount

1 6 11 16 21 26 31 47Degree

Node Amount

30

20

10

0

Fig. 13. Two steps with two schemes applied to the Face-to-Face contactdataset. In each row, changes in the number of structural features arevisualized on the left and the CDE with the new edges is displayed onthe right. Besides, edges indicating the feature transformations in Step 2are highlighted in blue.

In this case, the effectiveness of processing priority is verified. To getsatisfactory results, it is necessary to start with an appropriate priority.If users want to preserve the features consisting of a group of nodes,the decisions need to be made carefully. The locking function notonly limits the modification of the node but also limits the means toprotect the node. As mentioned in Section 4.2, one of two privacypreserving approaches must modify the node for anonymization. If alocked node has privacy issue, automatic algorithms can only tackleit with one approach, preventing the consideration of the utility. Insummary, locking nodes can only protect the features of nodes, but notthe features of the entire graph.

7 DISCUSSION

To solicit further feedback about our system, we interviewed domainexperts in data privacy about how tools like GraphProtector canprovide a detailed, guided approach for enabling privacy preservation.In addition, we discuss additional considerations for tools like this:steering sanitization pipelines, dealing with more complex networkdatasets, and current system limitations.

7.1 Expert ReviewsDuring the design process, we kept in touch with the domain expertwho helped formalize a set of task requirements (Section 3). Duringdevelopment, he continually provided suggestions for improving the

Page 9: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

system’s functionality and design—for example, selecting nodes in thepriority view by ranking percentage is more common in their field ofresearch as opposed to specific attribute values.

After the system was fully implemented, we interviewed anadditional 4 experts who research privacy preservation and/or computersecurity. Specifically, each was familiar with network privacy,especially the use of algorithms to preserve privacy and sanitizedatasets. Normally this is done in a data-only context. Visualizationis not leveraged and the application of algorithms is done as a“one-size-fits-all” approach; there is no customized sanitization fordataset subsets. This means that the resultant utility of a dataset is amajor concern of theirs.

In discussing their normal preservation workflows, two mainproblems were commonly encountered: (1) reviewing data in tabularform is time-consuming and tedious, and (2) it is intractable toformulate and compare detailed solutions to specific issues.

For each expert, we first introduced the GraphProtector system andconducted a live, hands-on demo (taking approximately 30 minutes).We then discussed the feasibility of our approach and asked howfeasible they considered tools like GraphProtector for their datasetsanitization needs.

Being able to measure utility on-the-fly was immediately and widelyseen as a major benefit by several of the experts. Integrating multipletechniques with custom schemes was also seen as beneficial. Thoughnot all experts used our specific protectors modules in their normalworkflows, they did not encounter difficulty in learning their designand understanding their functionality.

A slight surprise to us was that some experts had trouble withthe provenance view. As one noted, his workflow did not involvelooking at historical snapshots or any exploratory, user-in-the-loopcomponents. Instead, his research heavily rely on developing advancedalgorithms that can automatically identify the optimal parametersfor privacy preservation. In contrast, another expert praised thatour system provided a “fine-grained data processing” pipeline. Inparticular, this expert appreciated being able to locally set k-values aswell as processing priority—he saw this as an improvement for accurateprivacy preservation. For his research group, such a design exactly mettheir requirements: seeking detailed sanitization solutions for networkdatasets.

An additional expert commented that visualizing the data helpsinterpretation, especially when presenting results for a processeddataset. He expressed interest in using our design in the future whenpresenting and publishing his results at conferences/journals or whenthere is a need for explanation to shareholders and non-experts.

7.2 Detailed GuidanceBy integrating automatic models, a visual analysis system allowsusers to make adjustments in a steering fashion [29]. By using visualanalytics, we “uncover the black box” of the privacy models and meetthe needs for specific, customized privacy preservation settings.

In GraphProtector, users guide automated algorithms in twoaspects: terminals (privacy preserving goals) and directions (processingpriorities). For individual or localized issues, kL values can be set. Thisis significant because requirement differences in data distribution andpractical applications make global settings hard to transfer betweendata contexts. Automatic algorithms cannot easily fulfill this task inmultiple scenarios.

In addition, setting reasonable priority orderings can guide directionsfor automatic algorithms and simplify the process of balancing privacyand utility. It is verified that finding an optimal solution to privacypreservation is NP-hard [4]. Based on the user interactions, theexploration space is greatly limited in a way that focuses graphsanitization according to GraphProtector’s workflow.

7.3 Defining Graph UtilityIn Section 3, we defined utility as the similarity of structural propertieswithin a graph [27] . It is important to note that, similar to howprivacy algorithms depend on task and context, a definition of utilityalso depends on task and context. As an example, instead of measuring

the amount that the graph’s structure has changed, utility could bemeasured as how accurate (or precise) the answers to a set of rule-basedquestions will be. After sanitization, if the number of correct responsesremains high, then utility could be considered successfully preserved.

For our purposes, our collaborating expert users were solelyinterested in measuring utility via structural graph properties. Adding innew metrics for evaluating utility in new scenarios is certainly possible,but is likewise beyond this paper’s scope.

7.4 Scalability and Performance

Scalability is almost always a point of concern when designingvisual interfaces. Given limited screen space, how to best visualizeand facilitate the understanding of large and dense graph datasetsis a challenging and open research topic. A common practice foraddressing such an issue is to obtain different global and local statisticalfeatures that describe the input graph in multiple aspects (such asthe functionality provided in our GraphProtector system). Anotherpossible approach is to first apply clustering or aggregation algorithmsto reveal the high-level structure of the graph and then allow the userto acquire more detailed information about different parts of the graphthrough exploration. Further detailed discussion along this direction,however, is beyond the scope of this paper.

Computational efficiency is important for interactive system asusers typically expect a short response time. Unfortunately, queryingstructures and statistical information in large graphs is time– andmemory–intensive. For instance, querying 20% of all nodes has atime complexity of θ(N3.5) [9], even when specific information likelabels is known.

Pre–computation is a possible solution to improve the performanceof tools like GraphProtector. However, during sanitization the graphis dynamically changing—making pre–computation infeasible. Inaddition, some complicated structural features may be manuallyidentified based on the user’s current semantics. For instance, usersmay need to specify subgraphs with arbitrary structures and fulfillsubgraph-matching tasks with respect to the background of attackers.

To improve efficiency, we can adjust optimization goals by thesetting kG and kL values and performing lazy searches. When queryingfor a structure, the search stops once adequate quantified subgraphsare found; users can be informed that this issue is not a risk. Thisapproach is feasible because small structures tend to frequently appearin large-scale networks.

8 CONCLUSION

Privacy preservation is a challenging topic due to the conflict betweenprivacy and utility. Diverse features cause massive perspectives forcracking the graph, which increases the difficulty of defense. We designand implement a novel visual interface, called GraphProtector, to resistpotential attacks on structural features. After dividing questions intoissues, users can work on an issue each time and customize accuratestrategies. Feedback about the utility is presented to assist users incomparing strategies.

We demonstrate the effectiveness of our approach through twocase studies with real-world datasets and expert interviews. Withthe development of security technology, privacy protection mayface increasing multi-facet threats. Integrating more comprehensivetechniques can provide stronger defenses but can also lead to conflicts.In the future, we plan to find a more effective way to control conictsgenerated from hybrid approaches.

ACKNOWLEDGEMENTS

This research has been sponsored in part by National 973 Program ofChina (2015CB352503), National Natural Science Foundation of China(61772456, 61761136020), Alibaba-Zhejiang University Joint Instituteof Frontier Technologies. This research is also supported in part bythe U.S. National Science Foundation through grant IIS-1320229 andIIS-1741536.

Page 10: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

REFERENCES

[1] Infectious. http://konect.uni-koblenz.de/networks/

sociopatterns-infectious.[2] U. rovira i virgili. http://konect.uni-koblenz.de/networks/

arenas-email.[3] D. Archambault and N. Hurley. Visualization of trends in subscriber

attributes of communities on mobile telecommunications networks. SocialNetwork Analysis and Mining, 4(1):205, Jun 2014.

[4] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios.Disclosure limitation of sensitive rules. In Workshop on Knowledge andData Engineering Exchange, pp. 45–52. IEEE, 1999.

[5] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou R3579X?:anonymized social networks, hidden patterns, and structural steganography.In Proceedings of the 16th international conference on World Wide Web,pp. 181–190. ACM, 2007.

[6] P. Bonacich. Power and centrality: A family of measures. Americanjournal of sociology, 92(5):1170–1182, 1987.

[7] M. Boss, H. Elsinger, M. Summer, and S. Thurner 4. Network topologyof the interbank market. Quantitative Finance, 4(6):677–684, 2004.

[8] R. S. Burt. Structural holes and good ideas. American journal of sociology,110(2):349–399, 2004.

[9] V. Carletti, P. Foggia, A. Saggese, and M. Vento. Challenging the timecomplexity of exact subgraph isomorphism for huge and dense graphs withVF3. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017.

[10] G. Chartrand, P. Erdos, and O. R. Oellermann. How to define an irregulargraph. College Math. J, 19(1):36–42, 1988.

[11] J. Cheng, A. W.-c. Fu, and J. Liu. K-isomorphism: privacy preservingnetwork publication against structural attacks. In Proceedings of the 2010ACM SIGMOD International Conference on Management of data, pp.459–470.

[12] J.-K. Chou, C. Bryan, and K.-L. Ma. Privacy preserving visualization forsocial network data with ontology information. In Pacific VisualizationSymposium, pp. 11–20. IEEE, 2017.

[13] J.-K. Chou, Y. Wang, and K.-L. Ma. Privacy preserving event sequencedata visualization using a Sankey diagram-like representation. InSIGGRAPH ASIA 2016 Symposium on Visualization, p. 1.

[14] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub) graphisomorphism algorithm for matching large graphs. IEEE Transactions onPattern Analysis and Machine Intelligence, 26(10):1367–1372, 2004.

[15] A. Dasgupta and R. Kosara. Adaptive privacy-preserving visualizationusing parallel coordinates. IEEE Transactions on Visualization andComputer Graphics, 17(12):2241–2248, 2011.

[16] W. Didimo, G. Liotta, F. Montecchiani, and P. Palladino. An advancednetwork visualization system for financial crime detection. In IEEE PacificVisualization Symposium, pp. 203–210, 2011.

[17] S. P. Dwivedi and R. S. Singh. Error-tolerant graph matching usinghomeomorphism. In International Conference on Advances in Computing,Communications and Informatics, pp. 1762–1766, 2017.

[18] L. C. Freeman. A set of measures of centrality based on betweenness.Sociometry, 40(1):35–41, 1977.

[19] L. C. Freeman. Centrality in social networks conceptual clarification.Social networks, 1(3):215–239, 1978.

[20] A. Garas, P. Argyrakis, and S. Havlin. The structural role of weak andstrong links in a financial market network. The European Physical JournalB, 63(2):265–271, 2008.

[21] A. Gibbons. Algorithmic graph theory. Cambridge University Press, 1985.[22] M. Girvan and M. E. Newman. Community structure in social and

biological networks. Proceedings of the National Academy of Sciences,99(12):7821–7826, 2002.

[23] M. Gjoka, B. Tillman, and A. Markopoulou. Construction of simplegraphs with a target joint degree matrix and beyond. In IEEE Conferenceon Computer Communications, pp. 1553–1561, 2015.

[24] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resistingstructural re-identification in anonymized social networks. Proceedings ofthe VLDB Endowment, 1(1):102–114, 2008.

[25] K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu,D. Koutra, C. Faloutsos, and L. Li. Rolx: structural role extraction& mining in large graphs. In Proceedings of the 18th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pp.1231–1239, 2012.

[26] N.-C. Hsieh. An integrated data mining and behavioral scoring model

for analyzing bank customers. Expert Systems with Applications,27(4):623–633, 2004.

[27] S. Ji, W. Li, P. Mittal, X. Hu, and R. A. Beyah. Secgraph: A uniformand open-source evaluation system for graph data anonymization andde-anonymization. In USENIX Security Symposium, pp. 303–318, 2015.

[28] S. Ji, W. Li, M. Srivatsa, J. S. He, and R. Beyah. Structure based datade-anonymization of social networks and mobility traces. In InternationalConference on Information Security, pp. 237–254. Springer, 2014.

[29] D. Keim, G. Andrienko, J.-D. Fekete, C. Gorg, J. Kohlhammer, andG. Melancon. Visual analytics: Definition, process, and challenges. InA. Kerren, J. T. Stasko, J.-D. Fekete, and C. North, eds., InformationVisualization, pp. 154–175. Springer, 2008.

[30] N. Korula and S. Lattanzi. An efficient reconciliation algorithm for socialnetworks. Proceedings of the VLDB Endowment, 7(5):377–388, 2014.

[31] O. D. Lampe and H. Hauser. Curve density estimates. Computer GraphicsForum, 30(3):633–642, 2011.

[32] J. Leskovec and J. J. Mcauley. Learning to discover social circles in egonetworks. In Advances in Neural Information Processing Systems, pp.539–547, 2012.

[33] K. Liu and E. Terzi. Towards identity anonymization on graphs. InProceedings of the 2008 ACM SIGMOD International Conference onManagement of Data, pp. 93–106.

[34] S. Liu, W. Cui, Y. Wu, and M. Liu. A survey on information visualization:recent advances and challenges. The Visual Computer, 30(12):1373–1393,2014.

[35] Q. Luo and D. Zhong. Using social network analysis to explaincommunication characteristics of travel-related electronic word-of-mouthon social networking sites. Tourism Management, 46:274–282, 2015.

[36] K. K. Moller and A. Halinen. Business relationships and networks::Managerial challenge of network era. Industrial Marketing Management,28(5):413–427, 1999.

[37] A. Narayanan and V. Shmatikov. De-anonymizing social networks. InProceedings of the 30th IEEE Symposium on Security and Privacy, pp.173–187, 2009.

[38] S. Nilizadeh, A. Kapadia, and Y.-Y. Ahn. Community-enhancedde-anonymization of online social networks. In Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications Security, pp.537–548.

[39] J. Oksanen, C. Bergman, J. Sainio, and J. Westerholm. Methods forderiving and calibrating privacy-preserving heat maps from mobile sportstracking application data. Journal of Transport Geography, 48:135–144,2015.

[40] S. A. Rıos, F. Aguilera, J. D. Nunez-Gonzalez, and M. Grana. Semanticallyenhanced network analysis for influencer identification in online socialnetworks. Neurocomputing, 2017.

[41] A. Sala, X. Zhao, C. Wilson, H. Zheng, and B. Y. Zhao. Sharing graphsusing differentially private graph models. In Proceedings of the 2011 ACMSIGCOMM Conference on Internet Measurement Conference, pp. 81–98.

[42] T. Schank and D. Wagner. Approximating clustering coefficient andtransitivity. Journal of Graph Algorithms and Applications, 9(2):265–275,2005.

[43] D. W. Scott. Multivariate Density Estimation: Theory, Practice, andVisualization. John Wiley & Sons, 2015.

[44] G.-D. Sun, Y.-C. Wu, R.-H. Liang, and S.-X. Liu. A survey of visualanalytics techniques and applications: State-of-the-art research and futurechallenges. Journal of Computer Science and Technology, 28(5):852–867,2013.

[45] L. Sweeney. k-anonymity: A model for protecting privacy. InternationalJournal of Uncertainty, Fuzziness and Knowledge-Based Systems,10(05):557–570, 2002.

[46] B. Thompson and D. Yao. The union-split algorithm and cluster-basedanonymization of social networks. In Proceedings of the 4th InternationalSymposium on Information, Computer, and Communications Security, pp.218–227. ACM, 2009.

[47] T. von Landesberger, A. Kuijper, T. Schreck, J. Kohlhammer, J. J. vanWijk, J. Fekete, and D. W. Fellner. Visual analysis of large graphs:State-of-the-art and future research challenges. Computer Graphics Forum,30(6):1719–1749, 2011.

[48] X. Wang, J.-K. Chou, W. Chen, H. Guan, W. Chen, T. Lao, and K.-L.Ma. A utility-aware visual approach for anonymizing multi-attributetabular data. IEEE Transactions on Visualization and Computer Graphics,24(1):351–360, 2018.

[49] Y. Wang and X. Wu. Preserving differential privacy in degree-correlation

Page 11: GraphProtector: A Visual Interface for Employing and Assessing … · The visual privacy preservation stage of GraphProtector. (a) The graph protector view integrates multiple privacy-preserving

based graph generation. Transactions on data privacy, 6(2):127–145,2013.

[50] X. Wu, X. Ying, K. Liu, and L. Chen. A survey of privacy-preservationof graphs and social networks. In C. C. Aggarwal and H. Wang, eds.,Managing and mining graph data, pp. 421–453. Springer US, Boston,MA, 2010.

[51] Y. Wu, S. Liu, K. Yan, M. Liu, and F. Wu. Opinionflow: Visual analysisof opinion diffusion on social media. IEEE Transactions on Visualizationand Computer Graphics, 20(12):1763–1772, 2014.

[52] M. Xiao, J. Wu, and L. Huang. Community-aware opportunisticrouting in mobile social networks. IEEE Transactions on Computers,63(7):1682–1695, 2014.

[53] Q. Xiao, R. Chen, and K.-L. Tan. Differentially private network datarelease via structural inference. In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pp.911–920. ACM, 2014.

[54] J. Yang and J. Leskovec. Overlapping community detection at scale: anonnegative matrix factorization approach. In Proceedings of the sixthACM International Conference on Web Search and Data Mining, pp.587–596, 2013.

[55] X. Ying and X. Wu. Randomizing social networks: a spectrum preservingapproach. In Proceedings of the 2008 SIAM International Conference onData Mining, pp. 739–750.

[56] B. Zhou and J. Pei. Preserving privacy in social networks againstneighborhood attacks. In Proceedings of the IEEE 24th InternationalConference on Data Engineering, pp. 506–515, 2008.

[57] L. Zou, L. Chen, and M. T. Ozsu. K-automorphism: A general frameworkfor privacy preserving network publication. Proceedings of the VLDBEndowment, 2(1):946–957, 2009.