PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice:...

16
PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1,2 , Matei Ripeanu 2 , Nicolas Tripoul 2 , Geoffrey Sanders 1 , Roger Pearce 1 1 Center for Applied Scientific Computing, Lawrence Livermore National Laboratory 2 Department of Electrical and Computer Engineering, University of British Columbia {treza, matei, mtripoul}@ece.ubc.ca, {sanders29, rpearce}@llnl.gov Abstract—Pattern matching is a powerful graph analysis tool. Unfortunately, existing solutions have limited scalability, support only a limited set of search patterns, and/or focus on only a subset of the real-world problems associated with pattern matching. This paper presents a new algorithmic pipeline that: (i) enables highly scalable pattern matching on labeled graphs, (ii) supports arbitrary patterns, (iii) enables trade-offs between precision and time-to-solution (while always selecting all vertices and edges that participate in matches, thus offering 100% recall), and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT and demonstrate its advantages through strong and weak scaling experiments on massive-scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) graphs, respectively, and at scales (1,024 nodes / 36,864 cores) orders of magnitude larger than used in the past for similar problems. I. I NTRODUCTION Pattern matching in graphs, that is, finding subgraphs that match a small template graph within a large background graph, is fundamental to graph analysis and has applications in multiple areas such as social network analysis [1], bioin- formatics [2], and information mining [3]. A match can be broadly categorized as either exact - i.e., there is a bijective map between the vertices/edges in the template and those in the matching subgraph, or approximate - the template and the match are just similar by some defined similarity metric [4]. If the template size is not limited, exact matching is not known to have a polynomial time solution in the general case [5]. Berry et al. [6] introduced the problem of type- isomorphism: metadata graphs where vertices and edges are labeled and, in addition to topological constraints, a match identifies nodes and edges with the same labels in the template and the background graph. While the labeled version does not reduce the worst-case complexity of the original problem, past experience [1], [7], [8] has demonstrated that label-based matching can be a powerful tool with potential for practical, real-world applications such as social network analysis. The Challenge. Applications that mine graphs consist of tens of billions of edges are common [9], [10]. However, existing pattern matching solutions (we survey related work in §VII) have limited capabilities: most importantly, they do not scale to massive graphs and/or support only a restricted set of search templates. Additionally, the algorithms at the core of the existing techniques are not suitable for today’s infrastructures relying on horizontal scalability and share- Template Graph Output Fig. 1: An example of a background graph G (center), template G 0 (left) and the output G * after vertex and edge elimination (right). The output is a refined set of vertices and edges that contain every subgraph H that matches G 0 . The figures present vertex metadata as colored shapes. The eliminated vertices and edges are colored solid grey. nothing clusters as most of these algorithms are inherently sequential and difficult to parallelize [5], [11], [12]. Finally, pattern matching is susceptible to combinatorial explosion of the intermediate or final state: for low selectivity queries, the number of subgraphs partially (or entirely) matching the template can grow exponentially with the number of nodes and edges in the already large background graph [13], [14], posing serious memory and communication challenges. A New Approach for Scalable Pattern Matching. We pro- pose a new algorithmic pipeline based on graph pruning. The idea of exploring graph pruning to support pattern matching stems from three key observations: First, the traditionally used tree-search techniques [5] generally attempt to enumerate all matches through explicit search. When a search path fails, such an unprofitable path is marked invalid and ignored in the subsequent steps. Similar to past works that use graph pruning [15], [16], [17] or, more generally, input reduction [18], we observe that it is much cheaper to first focus on eliminating the vertices and edges that do not meet the label and topological constraints introduced by the search template. Our experience shows that relatively simple pruning heuristics based on label and vertex neighborhood constraints can significantly prune away much of the background graph. The key contributions of this paper is a pruning-based solution that limits the exponen- tial growth of the state-space, scales to massive graphs and distributed memory machines with large number of processors, and supports arbitrary search templates: the result of pruning is the complete set of all vertices and edges that participate in a match, with no false positives or false negatives. Second, we observe that a vertex-centric formulation for such pruning algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric frameworks (e.g., Giraph [19], GraphLab [20], HavoqGT [21]). In our vertex-centric formulation for pruning, a vertex must satisfy SC18, November 11-16, 2018, Dallas, Texas, USA 978-1-5386-8384-2/18/$31.00 c 2018 IEEE

Transcript of PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice:...

Page 1: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

PruneJuice Pruning Trillion-edge Graphsto a Precise Pattern-Matching Solution

Tahsin Reza12 Matei Ripeanu2 Nicolas Tripoul2 Geoffrey Sanders1 Roger Pearce11Center for Applied Scientific Computing Lawrence Livermore National Laboratory2Department of Electrical and Computer Engineering University of British Columbiatreza matei mtripouleceubcca sanders29 rpearcellnlgov

AbstractmdashPattern matching is a powerful graph analysis toolUnfortunately existing solutions have limited scalability supportonly a limited set of search patterns andor focus on only a subsetof the real-world problems associated with pattern matchingThis paper presents a new algorithmic pipeline that (i) enableshighly scalable pattern matching on labeled graphs (ii) supportsarbitrary patterns (iii) enables trade-offs between precision andtime-to-solution (while always selecting all vertices and edgesthat participate in matches thus offering 100 recall) and (iv)supports a set of popular data analytics scenarios We implementour approach on top of HavoqGT and demonstrate its advantagesthrough strong and weak scaling experiments on massive-scalereal-world (up to 257 billion edges) and synthetic (up to 44trillion edges) graphs respectively and at scales (1024 nodes 36864 cores) orders of magnitude larger than used in the pastfor similar problems

I INTRODUCTION

Pattern matching in graphs that is finding subgraphs thatmatch a small template graph within a large backgroundgraph is fundamental to graph analysis and has applicationsin multiple areas such as social network analysis [1] bioin-formatics [2] and information mining [3] A match can bebroadly categorized as either exact - ie there is a bijectivemap between the verticesedges in the template and those inthe matching subgraph or approximate - the template andthe match are just similar by some defined similarity metric[4] If the template size is not limited exact matching is notknown to have a polynomial time solution in the generalcase [5] Berry et al [6] introduced the problem of type-isomorphism metadata graphs where vertices and edges arelabeled and in addition to topological constraints a matchidentifies nodes and edges with the same labels in the templateand the background graph While the labeled version doesnot reduce the worst-case complexity of the original problempast experience [1] [7] [8] has demonstrated that label-basedmatching can be a powerful tool with potential for practicalreal-world applications such as social network analysis

The Challenge Applications that mine graphs consist oftens of billions of edges are common [9] [10] Howeverexisting pattern matching solutions (we survey related workin sectVII) have limited capabilities most importantly they donot scale to massive graphs andor support only a restrictedset of search templates Additionally the algorithms at thecore of the existing techniques are not suitable for todayrsquosinfrastructures relying on horizontal scalability and share-

Template

Graph Output

Fig 1 An example of a background graph G (center) template G0 (left) andthe output Glowast after vertex and edge elimination (right) The output is a refinedset of vertices and edges that contain every subgraph H that matches G0 Thefigures present vertex metadata as colored shapes The eliminated vertices andedges are colored solid grey

nothing clusters as most of these algorithms are inherentlysequential and difficult to parallelize [5] [11] [12] Finallypattern matching is susceptible to combinatorial explosion ofthe intermediate or final state for low selectivity queriesthe number of subgraphs partially (or entirely) matching thetemplate can grow exponentially with the number of nodesand edges in the already large background graph [13] [14]posing serious memory and communication challenges

A New Approach for Scalable Pattern Matching We pro-pose a new algorithmic pipeline based on graph pruning Theidea of exploring graph pruning to support pattern matchingstems from three key observations First the traditionally usedtree-search techniques [5] generally attempt to enumerate allmatches through explicit search When a search path failssuch an unprofitable path is marked invalid and ignored in thesubsequent steps Similar to past works that use graph pruning[15] [16] [17] or more generally input reduction [18] weobserve that it is much cheaper to first focus on eliminating thevertices and edges that do not meet the label and topologicalconstraints introduced by the search template Our experienceshows that relatively simple pruning heuristics based on labeland vertex neighborhood constraints can significantly pruneaway much of the background graph The key contributions ofthis paper is a pruning-based solution that limits the exponen-tial growth of the state-space scales to massive graphs anddistributed memory machines with large number of processorsand supports arbitrary search templates the result of pruningis the complete set of all vertices and edges that participate ina match with no false positives or false negatives

Second we observe that a vertex-centric formulation forsuch pruning algorithms exists and this makes it possible toharness existing high-performance vertex-centric frameworks(eg Giraph [19] GraphLab [20] HavoqGT [21]) In ourvertex-centric formulation for pruning a vertex must satisfy

SC18 November 11-16 2018 Dallas Texas USA978-1-5386-8384-218$3100 ccopy2018 IEEE

Template (b)Template (a) Template (c)

Fig 2 Three examples of search templates and background graphs thatjustify the full set of pruning constraints Template (a) is a 3-cycle cycles oflength 3k with repeated labels in the background graph meet neighborhoodconstraints surviving LCC Template (b) contains several vertices with non-unique labels to its right there is a background graph that meets individualpoint-to-point path constraints also surviving NLCC-path checking Template(c) is characterized by two 4-cliques that overlap at a 3-cycle the backgroundgraph structure to the right is doubly periodic (a 4times3 torus) and meets all edgeand vertex cycle constraints surviving NLCC-cycle checking Templates (b)and (c) require NLCC-template-driven search to guarantee no false positivestemplate (a) only needs cycle checking in addition to LLC

two types of constraints local and non-local to possibly bepart of a match Local constraints involve only the vertex andits neighborhood a vertex in an exact match needs to (i) matchthe label of a corresponding vertex in the template and (ii)have edges to vertices labeled as prescribed in the adjacencystructure of this corresponding vertex in the template Non-local constraints are topological requirements beyond theimmediate neighborhood of a vertex (eg that the vertex mustbe part of a cycle) We describe how these constraints aregenerated and our algorithmic solution to verify them in sectIII

Third we observe that full match enumeration is notthe most efficient avenue to support many high-level graphanalysis scenarios Depending on the final goal of the userpattern matching problems fall into a number of categorieswhich include (a) determining if a match exists (or not) inthe background graph (yesno answer) (b) selecting all thevertices and edges that participate in matches (c) rankingthese vertices or edges based on their centrality with respectto the search template ie the frequency of their participationmatchs (d) countingestimating the total number of matches(comparable to the well-known triangle counting problem) or(e) enumerating all distinct matches in the background graphThe traditional approach is to perform (e) and to use the resultof the enumeration to answer (a) ndash (d) However this approachis limited to small background graphs or is dependent on alow number of near and exact matches within the backgroundgraph (due to exponential growth of the search state-space)We argue that a pruning-based pipeline is not only a practicalsolution to (a) ndash (d) (and to other pattern-matching-related an-alytics) when full match enumeration is not the main interestbut also an efficient path towards full match enumeration Wedemonstrate that a solution starting from the techniques wedevelop for pruning efficiently supports match enumerationfor two reasons First the pruned graph can be multiple ordersof magnitude smaller than the background graph and existinghigh-complexity enumeration routines (which otherwise wouldbe intractable following the conventional approach) are nowapplicable Second our pruning techniques collect additionalkey information to accelerate match enumeration for eachvertex in the pruned graph our algorithms build a list of itspotential matches in the template (sectV-C))

Contributions We capitalize on our preliminary study [15]that highlighted the effectiveness of pruning (yet in the contextof a restricted set of templates) and design a pattern matching

solution that is generic - no restrictions on the set of patternssupported precise - no false positives offers 100 recall -retrieving all matches efficient - low generated network trafficand scalable - able to process graphs with up to trillions ofedges on tens of thousands of cores (as demonstrated) Inparticular we make the following contributions

(i) Novel Asynchronous Algorithms We have developedasynchronous vertex-centric algorithms able to prune the back-ground graph to a precise enumeration of all vertices andedges that participate in a match for arbitrary templates Thekey gap we bridge is the ability support templates with re-peated vertex labels cycles and arbitrary edge density Impor-tantly the algorithms map well on a distributed asynchronousgraph processing platform thus enabling scalability and high-performance For a given template we propose heuristics togenerate the constraints that are later used for pruning Wehave also developed correctness proofs to show that theseconstraints eliminate all non-matching vertices and offer fullrecall (not included here due to space constraints)

(ii) Optimized Distributed Implementation We offer anefficient implementation of these algorithms on the top ofHavoqGT [21] an open-source asynchronous graph processingframework The prototype includes two key optimizationsthat dramatically reduce the generated traffic aggressive edgeelimination and what we call work aggregation a techniquethat skips duplicate checks in non-local constraint checkingthus preventing possible combinatorial explosion Addition-ally our pruning implementation collects potential matchinginformation not only it prunes away all vertices and edges thatdo not participate in any match but for each of the verticesthat remain it collects their mappings to the search templateWe use this information to accelerate match enumeration

(iii) Proof of Feasibility We demonstrate the applicabilityof this solution by experimenting on real-world and syntheticdatasets orders of magnitude larger than prior work (sectV) Weevaluate scalability through two experiments first a strongscaling experiment using real-world datasets including thelargest openly available webgraph whose undirected versionhas over 257 billion edges second a weak scaling experimentusing synthetic R-MAT [22] graphs of up to 44 trillionedges on up to 1024 compute nodes (36864 cores) Wedemonstrate support for search patterns representative of prac-tical queries in both relatively low selectivity and needle inthe haystack scenarios and to stress our system considerpatterns containing only the highest-frequency vertex labels(up to 14B instances) We show that our technique prunes thegraph by orders of magnitude which combined with usingthe intermediary state generated by pruning makes matchenumeration feasible on graphs with trillions of edges

(iv) Application Demonstration We demonstrate the abilityof our solution to support practical graph analytics queriesTo this end we use two real-world metadata graphs which wehave curated from publicly available datasets Reddit (39Bvertices 14B edges) and the smaller International MovieDatabase (IMDB) and demonstrate practical use cases of ourtechnique to support rich pattern mining (sectV-D)

(v) Exploring Trade-offs and the Impact of Strategic DesignChoices and Optimizations Our approach has the added flexi-bility that search can be stopped early leading to the ability totrade faster time to an approximate solution (or even precisesolution yet without 100 precision guarantees) for precision(rate of false positives in the pruned graph) and precisionguarantees We also explore the impact of each optimizationused (sectV-F) the cumulative impact of these optimizations isa multiple orders of magnitude of runtime reduction bringingpattern matching on massive metadata graphs in the realm ofpossible graph analytics

II PRELIMINARIES

We aim to identify all structures within a large backgroundgraph G identical to a small connected template graph G0We describe general graph properties for G and use the samenotation (summarized in Table I) for other graph objects

A graph G(V E) is a collection of n vertices V =0 1 nminus 1 and m edges (i j) isin E where i j isin V (i isthe edgersquos source and j is the target) Here we only discusssimple (ie no self-edges) undirected vertex-labeled graphsalthough the techniques are applicable to directed non-simplegraphs with labels on both edges and vertices An undirectedG satisfies (i j) isin E if and only if (j i) isin E Vertex irsquosadjacency list adj(i) is the set of all j such that (i j) isin E A vertex-labeled graph also has a set of n` labels L of whicheach vertex i isin V has an assignment `(i) isin L

A walk in G is an ordered subsequence of V where eachconsecutive pair is an edge in E A walk with no repeatedvertices is a path A path with equal first and last vertex is acycle An acyclic graph has no cycles

We discuss several graph objects simultaneously the tem-plate graph G0(V0 E0) the background graph G(V E) andthe current solution subgraph Glowast(Vlowast Elowast) with Vlowast sub V andElowast sub E Our techniques iteratively refine Vlowast and Elowast until theyconverge to the union of all subgraphs of G that exactly matchthe template G0

For clarity when referring to vertices and edges from thetemplate graph G0 we will use the notation qi isin V0 and(qi qj) isin E0 Conversely we will use vi isin V and (vi vj) isin Efor vertices and edges from the background graph G or thesolution subgraph GlowastDefinition 1 A subgraph H(VH EH)VH sub V EH sub E is anexact match of template graph G0(V0 E0) (in notation H simG0) if there exists a bijective function φ V0 larrrarr VH with theproperties (Note that φ may not be unique for a given H)

(i) `(φ(q)) = `(q) for all q isin V0 and(ii) forall(q1 q2) isin E0 we have (φ(q1) φ(q2)) isin EH

(iii) forall(v1 v2) isin EH we have (φminus1(v1) φminus1(v2)) isin E0

Intuition for our Solution The algorithms we develop hereiteratively refines vertex-match functions ω(v) sub V0 such thatfor every v isin V ω(v) stores a super-set of all templatevertices v can possibly match Set ω(v) converges to containall possible values of φminus1(v) were v involved in one ormore matching subgraphs When a single constraint involving

TABLE I Symbolic notation usedObject(s) Notationtemplate graph vertices edges G0(V0 E0)template graph sizes n0 = |V0| m0 = |E0|template vertices V0 = q0 q1 qn0minus1template edges (qi qj) isin E0set of vertices adjacent to qi in G0 adj(qi)

background graph vertices edges G(V E)background graph sizes n = |V| m = |E|background vertices V = v0 v1 vnminus1background edges (vi vj) isin Eset of vertices adjacent to vi in G adj(vi)

maximum vertex degree in G dmax

label set L = 0 1 n` minus 1vertex label and label degree of qi `(qi) isin Lmatching subgraph vertices edges H(VH EH)solution subgraph vertices edges Glowast(Vlowast Elowast)vertex match function ω(vi) sub V0set of non-local constraints for G0 K0

q isin V0 is violatedunmet q is no longer a possibility for v ina match and q is removed ω(v)larr ω(v) q

Our previously developed pruning algorithms [15] requiredthat all vertex labels in the template G0 were unique and thatG0 be acyclic or edge-monocyclic to ensure 100 precisionOur new techniques achieve 100 precision for arbitrary G0

Remark 1 Given an ordered sequence of all n0 verticesq1 q2 qn0 sub V0 a simple (although potentially expen-sive) search from v1 isin Vlowast verifies if v1 is in a match withφ(q1) = v1 or not The search lists an ordered sequencev1 v2 vn0

sub Vlowast with φ defined as φ(qk) = vk Searchstep k proposes a new vk checking Def1 (i) and (ii) If allchecks are passed the search accepts vk and moves on to step(k + 1) but terminates if no such vk exists in Vlowast If the fulllist is generated with all label and edge checks passed thenthere exists a H sim G with VH = v1 v2 vn0

We call this Template-Driven Search (TDS) and develop anefficient distributed version in sectIV to apply to the solutionGlowast(Vlowast Elowast) If TDS has been applied successfully then thereare no false positives remaining

III GRAPH PRUNING FOR SCALABLE MATCHING

Our goal is to realize a technique which systematically elimi-nates all the vertices and edges that do not participate in anymatch H sim G0 This approach is motivated by viewing thetemplate G0 as specifying a set of constraints the vertices andedges that participate in a match must meet As a trivial exam-ple any vertex v whose label `(v) is not present in G0 cannotbe present in an exact match A vertex in an exact matchalso needs to have non-eliminated edges to non-eliminatedvertices labeled as prescribed in the adjacency structure of thecorresponding template vertex Local constraints that involve avertex and its neighborhood can be checked by having verticescommunicate their (tentative) template match(s) with theirone-hop neighbors in the solution subgraph Glowast(Vlowast Elowast) (iethe currently pruned background graph) We call this process

Identify leaf vertices with unique label

Identify vertices with

duplicate labelNon-edge

monocyclicCycle constraints to verify

(1) (2) (3) (2) (3)(1) (2) (3) (4)

Path constraints to verify between a pair of vertices that are not part of a cycle and have identical labels

(1)

Identical labels

Template

Sub-template

Step 1 Step 2 Step 3 - Cycle Constraints (CC) Step 4 - Path Constraints (PC) Step 5 - TDS Constraints

Fig 3 A high-level depiction of non-local constraint generation for the template in Fig 1 The figure shows the steps to generate required cycle constraints(CC) path constraints (PC) and higher-order constraints requiring template-drive search (TDS) (Due to limited space the figure presents only a subset of thepath and TDS constraints generated however sufficient to guarantee 100 precision)

LCC1

NLCC - PC NLCC - CC

NLCC - TDS

NLCC - CC

NLCC - TDS

LCC1

LCC1 LCC2 LCC1 LCC2

1Iteration 2 3 4 5

6 7 8 10 119Iteration

Fig 4 Algorithm walk through for the example background graph and template in Fig 1 depicting which vertices and edges in Glowast(Vlowast Elowast) are eliminated(in solid grey) during each iteration The NLCC constraints for G0 are listed in Fig 3 Due to space limitations the example does not show the applicationof some of the constraints in Fig 3 that do not eliminate vertices or edges

Local Constraint Checking (LCC) Our experiments show thatLCC is responsible for removing the bulk of non-matchingvertices and edges

Some classes of templates (with cycles or repeated vertex la-bels) require additional routines to check non-local propertiesand to guarantee that all non-matching vertices are eliminated(Fig 2 highlights the need for these additional checks) Tosupport arbitrary templates we have developed a processwhich we dub Non-local Constraint Checking (NLCC) firstbased on the search template G0 we generate the set ofconstraints K0 that are to be verified then prune the graphusing each of them

Alg 1 presents an overview of our solution This sectionprovides high-level descriptions of the local and non-localconstraint checking routines while sectIV provides the detaileddistributed algorithms For better understanding Fig 4 illus-trates the complete workflow for the graph and pattern in Fig 1for which constraint generation is detailed in Fig 3

Algorithm 1 Main Pruning Loop1 Input background graph G(V E) template G0(V0 E0)2 Output the solution subgraph Glowast(Vlowast Elowast)3 generate non-local constraint set K0 from G0(V0 E0)4 Glowast larr LOCAL CONSTRAINT CHECKING (GG0)5 while K0 is not empty do6 pick and remove next constraint C0 from K0

7 Glowast larr NON LOCAL CONSTRAINT CHECKING (GlowastG0 C0)8 if any vertex has been eliminated or9 has one of its potential matches remove then

10 Glowast larr LOCAL CONSTRAINT CHECKING (GlowastG0)

Local Constraint Checking (LCC) involves a vertex andits neighborhood The algorithm performs the following twooperations (i) Vertex elimination the algorithm excludes thevertices that do not have a corresponding label in the templatethen iteratively excludes the vertices that do not have neigh-bors as labeled in the template For templates that have vertices

with multiple neighbors with the same label the algorithmverifies if a matching vertex in the background graph has aminimum number of distinct neighbors with the same label asprescribed in the template (ii) Edge elimination this excludesedges to eliminated neighbors and edges to neighbors whoselabels do not match the labels prescribed in the adjacencystructure of its corresponding template vertex (eg Fig 4Iteration 1) Edge elimination is crucial for scalability sincein a distributed setting no messages are sent over eliminatededges significantly improving the overall efficiency of thesystem (evaluated in sectV Fig 8)

Non-local Constraint Checking (NLCC) aims to excludevertices that fail to meet topological and label constraintsbeyond the one-hop neighborhood that is covered by LCC(Fig 2) We have identified three types of non-local constraintswhich can be verified independently (i) Cycle Constraints(CC) (ii) Path Constraints (PC) and (iii) constraints thatrequire Template-Driven Search (TDS) (see Remark 1) Forarbitrary templates TDS constraints based on aggregatingmultiple pathscycles enable further pruning and when basedon full template enumeration1 insure that pruning yieldsno false positives Checking TDS constraints however canbe expensive To reduce the overall cost we first generatesingle cycle- and path-based constraints which are usally lesscostly to verify and prune the graph before using TDS (theeffectiveness of this ordering is evaluated in Fig 14(c))

High-level Algorithmic Approach Regardless of constrainttype NLCC leverages a token passing approach tokens areissued by background graph vertices whose correspondingtemplate vertices are identified to have non-local constraintsAfter a fixed number of steps we check if a token has arrived

1In a number of corner cases not all TDS constraints need to be aggregatedfor full precision we skip this discussion for lack of space

where expected (eg back to the originating vertex for check-ing the existence of a cycle) If not then the issuing vertex doesnot satisfy the required constraint and is eliminated Alongthe token path the system verifies that all expected labels areencountered and where necessary uses the path informationaccumulated with the token to verify that differentrepeatednode identity constraint expectations are met Next we discusshow each of the constraints is verified

Cycle Constraints (CC) Higher-order structures within Gthat survive LCC but do not contain G0 are possible if G0contains a cycle (this happens if G contains one or moreunrolled cycles as in Fig 2 Template (a)) To address thiswe directly check for cycles of the correct length

Path Constraints (PC) If the template G0 has two or morevertices with the same label three or more hops away fromeach other then structures in G that survive LCC yet containno match are possible (Fig 2 Template (b)) Thus for everyvertex pair with the same label in G0 we directly checkthe existence of a path of correct length and label sequencefor prospective matching vertices in Glowast Opposite to cyclechecking after a fixed number of steps a token must bereceived by a vertex different from the initiating vertex butwith an identical label

TDS Constraints These are partial or complete (ie in-cluding all edges of the template) walks on the template Thetoken walks the constraint in the background graph and verifiesthat each node visited meets its neighborhood constraints(Remark 1) - in our distributed memory setting this is done bymaintaining a history of the walk and checking that previouslyvisited vertices are revisited as expected TDS constraints arecrucial to guarantee no-false positives for templates that arenon-edge-monocyclic or have repeated labels (Fig 2)

Token Generation For CC and TDS constraints a tokenmust be initiated from each vertex that may participate in thesubstructure whereas for PC tokens are only initiated fromterminal vertices

Further Optimization - Work Aggregation All NLCC con-straints attempt to identify if a walk exists from a specificvertex and though vertices with specific labels Since the goalis to identify the existence of any such path and multipleintermediate paths in the background graph often exist toprevent combinatorial explosion our duplicate work detectionmechanism prevents an intermediately vertex (in the tokenpath) from forwarding a duplicate token (evaluated in sectV-F)

Non-local Constraint Generation For a relatively large classof templates (ie those with unique labels and mono-cyclicedges) LCC and CC are sufficient to generate a precise solu-tion For the general case we generate non-local constraintsusing the following heuristic (Fig 3 shows for a giventemplate the non-local constraints to be verified and Fig 4shows how pruning progresses) First all the leaf vertices withunique labels are identified and ignored from this process (asLCC guarantees pruning if there is no match) Next if thetemplate has cycles then individual cycles are identified (egFig 3 Step 3) and a cycle constraint is generated for eachcycle Next vertices with identical label are identified and all

path constraints are generated for all such pairs (eg Fig 3Step 4 pentagonal vertices)

Finally we identify TDS constraints in three steps Firstfor templates with multiple cycles sharing more than oneedge (ie non-edge-monocyclic) a TDS cyclic constraint isgenerated through union of previously identified cycle con-straints This results in a higher-order cyclic structure with amaximal set of edges that cover all the non-edge-mono cycles(eg Fig 3 Step 5(1)) Second for templates with repeatedlabels a new TDS constraint is generated through a unionof all previously identified path constraints This proceduregenerates higher order structure that covers all the templatevertices with repeated labels (eg Fig 3 Step 5(2)) Thefinal step generates a TDS constraint as the union of thepreviously identified two constraints (eg Fig 3 Step 5(3))(Note that the above is a heuristic more constraints could begenerated by creating various possible combinations of cyclesand paths Only this third step is mandatory to eliminate allfalse positives)

IV ASYNCHRONOUS DISTRIBUTED ALGORITHMS

This section presents the constraint checking algorithms ontop of HavoqGT [23] a MPI-based framework that supportsasynchronous graph algorithms in distributed environmentsOur choice for HavoqGTrsquos is driven by multiple considera-tions first unlike most graph processing frameworks that onlysupport the Bulk Synchronous Parallel (BSP) model HavoqGThas been designed to support asynchronous algorithms essen-tial to achieve high-performance second the framework hasexcellent scaling properties [24] [21] and finally it enablesload balancing HavoqGTrsquos delegate partitioned graph dis-tributes the edges of each high-degree vertices across multiplecompute nodes which is crucial for achieving scalability forscale-free graphs with skewed degree distribution

In HavoqGT graph algorithms are implemented as vertex-callbacks the user-defined visit() callback can only access andupdate the state of a vertex The framework offers the abilityto generate events (aka lsquovisitorsrsquo in HavoqGT lingo) thattrigger this callback - either at the entire graph level usingthe do traversal() method or for a neighboring vertex usingthe push(visitor) call (this enables asynchronous vertex-to-vertex communication) The asynchronous graph computationcompletes when all events have been processed which is de-termined by a distributed quiescence detection algorithm [25]

Alg 1 outlines the key steps of the graph pruning procedureBelow we describe the distributed implementation of the localand non-local constraint checking routines Alg 2 lists thestate maintained at each vertex and its initialization

Local Constraint Checking is implemented as an iterativeprocess (Alg 3 and the corresponding callback Alg 4) Eachiteration initiates an asynchronous traversal by invoking thedo traversal() method and as a result each active vertexreceives a visitor with msgtype = init In the triggered visit()callback if the label of a vertex vj in the graph is a match forthe label of any vertex in the template and the vertex is still

Algorithm 2 Vertex State and Initialization1 status of vertex vj α(vj)larr true (active) if existq isin V0 st `(vj) = `(q)

otherwise false (ie vj has been pruned)2 set of possible matches in template for vertex vj ω(vj) larr initially allqk isin V0 st `(qk) = `(vj)

3 map of active edges of vertex vj ε(vj)larr keys are initialized to adj(vj)The value field which is initially empty is set to ω(vi) for each vi isin ε(vj)that has communicated its state to vj

4 set of already forwarded tokens by vertex vj τ(vj) larr initially emptyused for work aggregation in NLCC

active it creates visitors for all its active neighbors in ε(vj)with msgtype = alive (Alg 4 line 9) When a vertex vj isvisited with msgtype = alive it verifies whether the sendervertex vs satisfies one of its own (vjrsquos) template constraintsby invoking the function η(vs vj) By the end of an iterationif vj satisfies all the template constraints ie it has neighborswith required labels (and if needed a minimum of distinctneighbors with the same label as prescribed in the template)it stays active (ie α(vj) = true) for the next iteration Fortemplates that have multiple vertices with the same label inany iteration a vertex with that label in the background graphcould match any of these vertices in the template so eachmach must be verified independently If vj fails to satisfy therequired constraints for a template vertex qk isin ω(vj) qk isremoved from ω(vj) At any stage if ω(vj) becomes emptythen vj is marked inactive (α(vj)larr false) and never createsvisitors again Edge elimination excludes two categories ofedges first the edges to neighbors vi isin ε(vj) from which vjdid not receive an alive message and second the edges toneighbors whose labels do not match the labels prescribed inthe adjacency structure of the corresponding template vertex(s)in ω(vj) A vertex vj is also marked inactive if its active edgelist ε(vj) becomes empty Iterations continue until no vertexandor edge is marked inactive

Algorithm 3 Local Constraint Checking1 η(vs vj) - tests if vs satisfies a local constraint of vj returns ω(s)2 if constraints are satisfied empty otherwise3 procedure LOCAL CONSTRAINT CHECKING (GG0)4 do5 do traversal(msgtype larr init)6 barrier7 for all vj isin V do8 ωprime larr empty set of matches in template for neighbors of vj9 for all vi isin ε(vj) do

10 if η(vi vj) = empty then11 ε(vj)remove(vi) edge eliminated12 continue13 else14 ωprime larr ωprime cup η(vi vj) accum matches of the nbrs15 reset the value field of vi isin ε(vj) for the next iteration16 for all qk isin ω(vj) do for each potential match17 if adj(qk) ωprime then18 qk does not meet neighbor requirements19 ω(vj)remove(qk) remove from potential matches20 continue21 if ε(vj) = empty or ω(vj) = empty then22 α(vj)larr false vertex eliminated23 while vertices or edges are eliminated global detection

Algorithm 4 Local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state vs - vertex that originated the visitor3 visitor state ω(vs) - set of possible matches in template for vertex vs4 visitor state msgtype - init or alive5 procedure VISIT(G vq) vq - visitor queue6 if α(vj) = false then return7 if msgtype = init then8 for all vi isin ε(vj) do9 vis larr LCC VISITOR(vi vj ω(vj) alive)

10 vqpush(vis)

11 else if msgtype = alive then12 ε(vj)get(vs)larr ω(vs)

Non-local Constraint Checking routine iterates over K0 theset of non-local constraints to be checked and validates eachC0 isin K0 one at a time Alg 5 describes the solution to verify asingle constraint tokens are initiated through an asynchronoustraversal by invoking the do traversal() method Each activevertex vj isin G that is a potential match for the vertex q0 at thehead of a path C0 broadcasts a token to all its active neighborsin ε(vj) A map γ is used to track these token issuers A tokenis a tuple (t r) where t is an ordered list of vertices that haveforwarded the token and r is the hop-counter t0 isin t is thetoken-issuing vertex in G The ordered list t is essential forTDS since it enables detection of distinct vertices with thesame label in the token path

Algorithm 5 Non-local Constraint Checking1 procedure NON LOCAL CONSTRAINT CHECKING(GG0 C0)2 γ larr map of token source vertices (in G) for C0 the value field3 (initialized to false) is set to true if the token source vertex meets4 the requirements of C05 do traversal(msgtype larr init)6 barrier7 for all vj isin γ do8 if γget(vj) 6= true then9 violates C0 eliminate potential match

10 ω(vj)remove(q0) where q0 is the first vertex in C011 if ω(vj) = empty then no potential match left12 α(vj)larr false vertex eliminated13 forallvj isin V reset τ(vj)

When an active vertex vj receives a token with msgtype =forward it verifies that if ω(vj) is a match for the next entryin C0 if it has received the token from a valid neighbor (withrespect to entries in C0) and that the current hop count is lt|C0| If these requirements are satisfied (ie micro returns true) vjsets itself as the forwarding vertex (added to t) increments thehop count and broadcasts the token to all its active neighborsin ε(vj) If any of the constraints are not met vj drops thetoken If the hop count r is equal to |C0| and vj is the same asthe source vertex in the token for a cyclic template a path hasbeen found and vj is marked true in γ For path constraintsan acknowledgement is sent to the token issuer to update itsstatus in γ (Alg 6 lines 28 ndash 31) Once verification ofa constraint C0 has been completed the vertices that are notmarked true in γ are invalidated ie α(vj)larr false (Alg 5line 12) NLCC uses an unordered set τ(vj) (Alg 2 line 4)for work aggregation (see Alg 6 line 14) at each vertexthis is used to detect if another copy of token has alreadyvisited the vertex vj taking a different path

Algorithm 6 Non-local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state token - the token is a tuple (t r) where t is an ordered list

of vertices that have forwarded the token and r is the hop-counter t0 isin tis the vertex that originated the token

3 visitor state msgtype - init forward or ack4 micro(vj C0 token) - tests if vj satisfies requirements of C0 for the current

state of token returns true if constraints are met false otherwise5 procedure VISIT(G vq)6 if α(vj) = false then return7 if msgtype = init and existqk isin ω(vj) where qk = q0 isin C0 then8 initiate a token vj is the token source9 tadd(vj) r larr 1 tokenlarr (t r) γinsert(vj false)

10 for all vi isin ε(vj) do11 vis larr NLCC VISITOR(vi token forward)12 vqpush(vis)

13 else if msgtype = forward then vj received a token14 if token isin τ(vj) then work aggregation optimization15 τ(vj)insert(token)16 else return ignore token if it was previously forwarded by vj17 if micro(vj C0 token) = true and tokenr lt |C0| then18 the walk can be extended with vj and it has not yet reached

|C0| length19 tokentadd(vj) tokenr larr tokenr + 120 for all vi isin ε(vj) do forward the token21 vis larr NLCC VISITOR(vi token forward)22 vqpush(vis)

23 else if micro(vj C0 token) = true and tokenr = |C0| then24 the walk can be extended with vj and it has reached

|C0| length25 if C0 is cyclic and t0 = vj then26 γget(vj)larr true return vj meets requirements of C027 else if C0 is acyclic and t0 6= vj then28 vis larr NLCC VISITOR(t0 token ack)29 vqpush(vis) send ack to the token originator t0 isin t30 else if msgtype = ack then31 γget(vj)larr true return vj meets requirements of C0

Termination Output and Match Enumeration Queries IfNLCC is not required the search terminates when no vertexis eliminated (or none of its potential matches is removed)in an LCC iteration Otherwise the search terminates whenall constraints in K0 have been verified and no vertex iseliminated (or none of its potential matches is removed) inthe following LCC phase The output is (i) the set of verticesand edges that survived the iterative elimination process and(ii) for each vertex in this set the mapping in the templatewhere a match has been identified A distributed enumerationor counting routine can operate on the pruned graph withthis information Alg 6 can be slightly modified to obtainthe enumeration of the matches in the background graph theconstraint used is the full template work aggregation is turnedoff and each possible match is verified

Metadata Store Metadata is stored independent of the graphtopology itself (which uses CSR format [26]) At initializationonly the required attributes are read from the file(s) stored ona distributed file system A light-weight distributed processbuilds the in-memory (or memory-mapped) metadata store On256 nodes for the 257 billion edge Web Data Commons graph[27] the metadata store can be built in under two minutesAlthough in this paper we consider vertex metadata (ielabels) only edge metadata is also supported

V EVALUATION

We present strong (sectV-B) and weak (sectV-A) scaling experi-ments of pruning on massive real-world and synthetic graphsadditionally we demonstrate full match enumeration startingfrom the pruned graph (sectV-C) we evaluate the effective-ness of the optimizations our system incorporates (sectV-F) wehighlight the use of our system in the context of realisticdata analytics scenarios (sectV-D) we explore time-to-solutionvs precisionguarantees trade-offs (sectV-E) and finally wecompare our solution with a recent work QFrag [28] (sectV-H)

Testbed The testbed is the 26PFlop Quartz cluster at theLawrence Livermore National Lab comprised of 2634 nodesand the Intel Omni-Path interconnect Each node has two 18-core Intel Xeon E5-2695v4 210GHz processors and 128GBof memory [29] We run one MPI process per core

Datasets We summarize the main characteristics of thedatasets used for evaluation and explain how we have gener-ated vertex labels where necessary For all graphs we createdundirected versions of the graphs two directed edges are usedto represent each undirected edge

The datasets used for evaluationType |V| 2|E| dmax

Web Data Commons [27] Real 35B 257B 95MReddit [30] Real 39B 14B 19MInternet Movie Database [31] Real 5M 29M 552KPatent [28] Real 27M 28M 789Youtube [28] Real 46M 88M 25KR-MAT up to Scale 37 [22] Synthetic 137B 44T 612M

Web Data Commons (WDC) graph is a web-graph whosevertices are webpages and edges are hyperlinks To createvertex labels we extract the top-level domain names fromthe webpage URL eg org or edu If the URL containsa common second-level domain name it is chosen over thetop-level domain name For example for oxacuk ac isselected as the vertex label A total of 2903 unique labelsare distributed among the 35B graph vertices We curatedthe Reddit social-media graph from an open archive [30]of billions of public posts and comments from RedditcomReddit allows its users to rate (upvote or downvote) othersrsquoposts and comments The graph has four types of verticesAuthor Post Comment and Subreddit (a category for posts)For the Post and Comment there are three possible labelsPositive Negative and Neutral (indicating the overall balanceof positive and negative votes) or No rating An edge ispossible between an Author and a Post an Author and aComment a Subreddit and a Post a Post and a Comment(to that post) and between two Comments that have a parent-child relationship The International Movie Database (IMDB)graph was curated from the publicly available data [31] Thegraph has five types of vertices Movie Genre Actress Actorand Director An edge is only possible between a Movie typevertex and a non-Movie type vertex We use the smaller Patentand YouTube graphs to compare published results by Serafiniet al [28] The synthetic R-MAT graphs exhibit approximatepower-law degree distribution These graphs were createdfollowing the Graph 500 [32] standards 2Scale vertices and an

2

4

3 7

5

3 7

TreeChain

7354723

Fig 5 Chain and Tree patterns used Both patterns have two pairs of verticeswith the same (numeric) label hence require non-local constraint checking(NLCC) more precisely path checking The labels used are the most frequentin the R-MAT graphs and cover sim30 of all the vertices in the graphs

Chain

28

620

5

16

020

5

15

50

56

5

41

990

6

16

34

58

5

499

182

6

86

35

0 7

15

08

95

6 6

2|E|

Tree

58

74 6

64

68 6

33

728

6

10

712

6

531

82 6

111

786

6

306

36 7

325

990

6PF (10x)

Compute nodes

R-MAT scale

0

9

18

27

36

45

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34 1

28

35 2

56

36 5

12

37

10

24

Tim

e (s

)

0

10

20

30

40

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34

12

8

35

25

6

36

51

2

37

102

4

109

225

6 5

36

36

04

7

165

71 4

93

05 5

896

27 5

244

03 5

94

85

9 5

289

312

4

50

26

3 6

876

37

0 5

|V| 33

35

5

37

04

5

19

02

5 5

61

57

6

30

35

6 5

63

48

2 5

17

66

8 6

18

57

21 5PF (10x)

633

889

5

20

77

29 5

635

350

636

977

3 5

256

77

0 7

147

96

8 5

LCC NLCC

Fig 6 Runtime and pattern selectivity for weak scaling experiments brokendown to individual iterations for the chain (left) and tree (right) patternspresented in Fig 5 The X-axis labels present the R-MAT scale and the nodecount used for the experiment (Each node hosts two processors each with 18cores) The number of vertices and edges in each pruned solution is shownon the top of respective bar plot The pruning factors (PF) ie the orderof magnitude reduction in number of verticesedges compared to the originalbackground graph are also shown A flat line indicates perfect weak scalingTime for LLC and NLCC phases is presented with different colors

undirected edge factor of 16 For example a Scale 30 graphhas |V| = 230 and |E| asymp 32 times 230 (as we create a directedversion) We leverage degree information to create vertexlabels computed using the formula `(vi) = dlog2(d(vi)+1)e

Search Templates To stress our system we use templatesbased on patterns naturally occurring and relatively frequentin the template graphs The R-MAT (Fig 5) and WDC (Fig 7)patterns include vertex labels that are among the most frequentin the respective graphs The Reddit and IMDB patterns(Fig 11) include most of the vertex labels in these two graphsWe chose templates to exercise different constraint checkingscenarios the search templates have repeated vertex metadataand non-edge-monocyclic properties

Experimental Methodology All runtime numbers providedare averages over 10 runs For weak scaling experimentswe do not present scaling numbers for a single node asthis experiment does not involve network communication andbenefits from data locality For strong scaling experimentsthe smallest experiment uses 64 nodes as this is the lowestnumber of nodes that can load the graph in memory

A Weak Scaling Experiments

To evaluate the ability to process massive graphs we useweak scaling experiments and the synthetic R-MAT graphsup to Scale 37 (sim44T edges) and up to 1024 nodes (36864cores) Fig 5 shows the two search pattern used and Fig 6presents the runtimes Since there are multiple vertices in the

WDC-1 WDC-3WDC-2

info

net edugovde

ac

org

gov edu

net

biz

info

mil ac

ca

gov edu

edu

edu

biz

deit

ac acmil

Fig 7 WDC patterns using topsecond-level domain names as labels Thelabels selected are among the most frequent covering sim22 of the verticesin the WDC graph org covers sim220M vertices the 2nd most frequent aftercom and mil is the least frequent among these labels coveringsim153K vertices

WDC-1 WDC-2 WDC-347232 117 81913|V|

020406080

100120

64

12

8

25

6

51

2

10

24

0

10

20

30

40

64

12

8

25

6

51

2

10

24

0

17

34

51

68

85

102

119

136

153

170

64

12

8

25

6

51

2

10

24

126212 546 2550222|E|

VE Only

VE and EE

0

10

20

30

40

50

64

12

8

25

6

51

2

10

24

40

1130

2220

3310

4400

64

12

8

25

6

51

2

10

24

LCC NLCC

Tim

e (s

)

Compute nodes

15x

21x27x

18x

35x

66x

18x

29x36x

112x 60x

31x

Fig 8 Runtime for strong scaling experiments broken down to individualphases (LCC and NLCC are in different colors) for the three patterns presentedin Fig 7 The top row of X-axis labels represent the number of compute nodes(Each node hosts two processors each with 18 cores) The last two rows arenumber of vertices and edges in the pruned graph respectively For bettervisibility for WDC-1 (left plots) runtime for different iterations are split intotwo scales on the Y-axis LCC and NLCC-path constraints are at the bottomand LCC and NLCC-TDS constraints are at the top To highlight the impactof edge elimination for WDC-3 (right plots) we present results with edgeelimination disabled (top) and enabled (bottom) Speedup over the 64 nodeconfiguration is also shown on the top of each stacked bar plot

pattern with identical labels (at more than two-hop distance)the patterns require NLCC - path checking - to ensure no falsepositives in the pruned graph We see steady scaling all theway to the Scale 37 graph which has sim44 trillion edgeson 1024 nodes (36864 cores) Runtime is broken down tothe individual iteration to evaluate scaling and the individualcontribution of each intermediate step As a graph gets prunedthe subsequent iterations require less time Fig 6 includes (atthe top of each bar) the final number of vertices and edgesthat participate in the respective patterns Note that the NLCCphases (needed to guarantee a precise solution) do not deleteany vertex or edge hence no further LCC phase is invoked

B Strong Scaling Experiments

Fig 8 shows the runtimes for strong scaling experiments whenusing the real-world WDC graph on up to 1024 nodes (36864cores) Intuitively pattern matching on the WDC graph isharder than on the R-MAT graphs as the WDC graph is bothdenser and has a more skewed degree distribution We usethe patterns presented in Fig 7 WDC-1 is acyclic yet hasmultiple vertices with the same label and thus requires non-local constraint checking (PC and TDS) For better visibilitythe plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left) We noticenear perfect scaling for the LCC phases however some of

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 2: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

Template (b)Template (a) Template (c)

Fig 2 Three examples of search templates and background graphs thatjustify the full set of pruning constraints Template (a) is a 3-cycle cycles oflength 3k with repeated labels in the background graph meet neighborhoodconstraints surviving LCC Template (b) contains several vertices with non-unique labels to its right there is a background graph that meets individualpoint-to-point path constraints also surviving NLCC-path checking Template(c) is characterized by two 4-cliques that overlap at a 3-cycle the backgroundgraph structure to the right is doubly periodic (a 4times3 torus) and meets all edgeand vertex cycle constraints surviving NLCC-cycle checking Templates (b)and (c) require NLCC-template-driven search to guarantee no false positivestemplate (a) only needs cycle checking in addition to LLC

two types of constraints local and non-local to possibly bepart of a match Local constraints involve only the vertex andits neighborhood a vertex in an exact match needs to (i) matchthe label of a corresponding vertex in the template and (ii)have edges to vertices labeled as prescribed in the adjacencystructure of this corresponding vertex in the template Non-local constraints are topological requirements beyond theimmediate neighborhood of a vertex (eg that the vertex mustbe part of a cycle) We describe how these constraints aregenerated and our algorithmic solution to verify them in sectIII

Third we observe that full match enumeration is notthe most efficient avenue to support many high-level graphanalysis scenarios Depending on the final goal of the userpattern matching problems fall into a number of categorieswhich include (a) determining if a match exists (or not) inthe background graph (yesno answer) (b) selecting all thevertices and edges that participate in matches (c) rankingthese vertices or edges based on their centrality with respectto the search template ie the frequency of their participationmatchs (d) countingestimating the total number of matches(comparable to the well-known triangle counting problem) or(e) enumerating all distinct matches in the background graphThe traditional approach is to perform (e) and to use the resultof the enumeration to answer (a) ndash (d) However this approachis limited to small background graphs or is dependent on alow number of near and exact matches within the backgroundgraph (due to exponential growth of the search state-space)We argue that a pruning-based pipeline is not only a practicalsolution to (a) ndash (d) (and to other pattern-matching-related an-alytics) when full match enumeration is not the main interestbut also an efficient path towards full match enumeration Wedemonstrate that a solution starting from the techniques wedevelop for pruning efficiently supports match enumerationfor two reasons First the pruned graph can be multiple ordersof magnitude smaller than the background graph and existinghigh-complexity enumeration routines (which otherwise wouldbe intractable following the conventional approach) are nowapplicable Second our pruning techniques collect additionalkey information to accelerate match enumeration for eachvertex in the pruned graph our algorithms build a list of itspotential matches in the template (sectV-C))

Contributions We capitalize on our preliminary study [15]that highlighted the effectiveness of pruning (yet in the contextof a restricted set of templates) and design a pattern matching

solution that is generic - no restrictions on the set of patternssupported precise - no false positives offers 100 recall -retrieving all matches efficient - low generated network trafficand scalable - able to process graphs with up to trillions ofedges on tens of thousands of cores (as demonstrated) Inparticular we make the following contributions

(i) Novel Asynchronous Algorithms We have developedasynchronous vertex-centric algorithms able to prune the back-ground graph to a precise enumeration of all vertices andedges that participate in a match for arbitrary templates Thekey gap we bridge is the ability support templates with re-peated vertex labels cycles and arbitrary edge density Impor-tantly the algorithms map well on a distributed asynchronousgraph processing platform thus enabling scalability and high-performance For a given template we propose heuristics togenerate the constraints that are later used for pruning Wehave also developed correctness proofs to show that theseconstraints eliminate all non-matching vertices and offer fullrecall (not included here due to space constraints)

(ii) Optimized Distributed Implementation We offer anefficient implementation of these algorithms on the top ofHavoqGT [21] an open-source asynchronous graph processingframework The prototype includes two key optimizationsthat dramatically reduce the generated traffic aggressive edgeelimination and what we call work aggregation a techniquethat skips duplicate checks in non-local constraint checkingthus preventing possible combinatorial explosion Addition-ally our pruning implementation collects potential matchinginformation not only it prunes away all vertices and edges thatdo not participate in any match but for each of the verticesthat remain it collects their mappings to the search templateWe use this information to accelerate match enumeration

(iii) Proof of Feasibility We demonstrate the applicabilityof this solution by experimenting on real-world and syntheticdatasets orders of magnitude larger than prior work (sectV) Weevaluate scalability through two experiments first a strongscaling experiment using real-world datasets including thelargest openly available webgraph whose undirected versionhas over 257 billion edges second a weak scaling experimentusing synthetic R-MAT [22] graphs of up to 44 trillionedges on up to 1024 compute nodes (36864 cores) Wedemonstrate support for search patterns representative of prac-tical queries in both relatively low selectivity and needle inthe haystack scenarios and to stress our system considerpatterns containing only the highest-frequency vertex labels(up to 14B instances) We show that our technique prunes thegraph by orders of magnitude which combined with usingthe intermediary state generated by pruning makes matchenumeration feasible on graphs with trillions of edges

(iv) Application Demonstration We demonstrate the abilityof our solution to support practical graph analytics queriesTo this end we use two real-world metadata graphs which wehave curated from publicly available datasets Reddit (39Bvertices 14B edges) and the smaller International MovieDatabase (IMDB) and demonstrate practical use cases of ourtechnique to support rich pattern mining (sectV-D)

(v) Exploring Trade-offs and the Impact of Strategic DesignChoices and Optimizations Our approach has the added flexi-bility that search can be stopped early leading to the ability totrade faster time to an approximate solution (or even precisesolution yet without 100 precision guarantees) for precision(rate of false positives in the pruned graph) and precisionguarantees We also explore the impact of each optimizationused (sectV-F) the cumulative impact of these optimizations isa multiple orders of magnitude of runtime reduction bringingpattern matching on massive metadata graphs in the realm ofpossible graph analytics

II PRELIMINARIES

We aim to identify all structures within a large backgroundgraph G identical to a small connected template graph G0We describe general graph properties for G and use the samenotation (summarized in Table I) for other graph objects

A graph G(V E) is a collection of n vertices V =0 1 nminus 1 and m edges (i j) isin E where i j isin V (i isthe edgersquos source and j is the target) Here we only discusssimple (ie no self-edges) undirected vertex-labeled graphsalthough the techniques are applicable to directed non-simplegraphs with labels on both edges and vertices An undirectedG satisfies (i j) isin E if and only if (j i) isin E Vertex irsquosadjacency list adj(i) is the set of all j such that (i j) isin E A vertex-labeled graph also has a set of n` labels L of whicheach vertex i isin V has an assignment `(i) isin L

A walk in G is an ordered subsequence of V where eachconsecutive pair is an edge in E A walk with no repeatedvertices is a path A path with equal first and last vertex is acycle An acyclic graph has no cycles

We discuss several graph objects simultaneously the tem-plate graph G0(V0 E0) the background graph G(V E) andthe current solution subgraph Glowast(Vlowast Elowast) with Vlowast sub V andElowast sub E Our techniques iteratively refine Vlowast and Elowast until theyconverge to the union of all subgraphs of G that exactly matchthe template G0

For clarity when referring to vertices and edges from thetemplate graph G0 we will use the notation qi isin V0 and(qi qj) isin E0 Conversely we will use vi isin V and (vi vj) isin Efor vertices and edges from the background graph G or thesolution subgraph GlowastDefinition 1 A subgraph H(VH EH)VH sub V EH sub E is anexact match of template graph G0(V0 E0) (in notation H simG0) if there exists a bijective function φ V0 larrrarr VH with theproperties (Note that φ may not be unique for a given H)

(i) `(φ(q)) = `(q) for all q isin V0 and(ii) forall(q1 q2) isin E0 we have (φ(q1) φ(q2)) isin EH

(iii) forall(v1 v2) isin EH we have (φminus1(v1) φminus1(v2)) isin E0

Intuition for our Solution The algorithms we develop hereiteratively refines vertex-match functions ω(v) sub V0 such thatfor every v isin V ω(v) stores a super-set of all templatevertices v can possibly match Set ω(v) converges to containall possible values of φminus1(v) were v involved in one ormore matching subgraphs When a single constraint involving

TABLE I Symbolic notation usedObject(s) Notationtemplate graph vertices edges G0(V0 E0)template graph sizes n0 = |V0| m0 = |E0|template vertices V0 = q0 q1 qn0minus1template edges (qi qj) isin E0set of vertices adjacent to qi in G0 adj(qi)

background graph vertices edges G(V E)background graph sizes n = |V| m = |E|background vertices V = v0 v1 vnminus1background edges (vi vj) isin Eset of vertices adjacent to vi in G adj(vi)

maximum vertex degree in G dmax

label set L = 0 1 n` minus 1vertex label and label degree of qi `(qi) isin Lmatching subgraph vertices edges H(VH EH)solution subgraph vertices edges Glowast(Vlowast Elowast)vertex match function ω(vi) sub V0set of non-local constraints for G0 K0

q isin V0 is violatedunmet q is no longer a possibility for v ina match and q is removed ω(v)larr ω(v) q

Our previously developed pruning algorithms [15] requiredthat all vertex labels in the template G0 were unique and thatG0 be acyclic or edge-monocyclic to ensure 100 precisionOur new techniques achieve 100 precision for arbitrary G0

Remark 1 Given an ordered sequence of all n0 verticesq1 q2 qn0 sub V0 a simple (although potentially expen-sive) search from v1 isin Vlowast verifies if v1 is in a match withφ(q1) = v1 or not The search lists an ordered sequencev1 v2 vn0

sub Vlowast with φ defined as φ(qk) = vk Searchstep k proposes a new vk checking Def1 (i) and (ii) If allchecks are passed the search accepts vk and moves on to step(k + 1) but terminates if no such vk exists in Vlowast If the fulllist is generated with all label and edge checks passed thenthere exists a H sim G with VH = v1 v2 vn0

We call this Template-Driven Search (TDS) and develop anefficient distributed version in sectIV to apply to the solutionGlowast(Vlowast Elowast) If TDS has been applied successfully then thereare no false positives remaining

III GRAPH PRUNING FOR SCALABLE MATCHING

Our goal is to realize a technique which systematically elimi-nates all the vertices and edges that do not participate in anymatch H sim G0 This approach is motivated by viewing thetemplate G0 as specifying a set of constraints the vertices andedges that participate in a match must meet As a trivial exam-ple any vertex v whose label `(v) is not present in G0 cannotbe present in an exact match A vertex in an exact matchalso needs to have non-eliminated edges to non-eliminatedvertices labeled as prescribed in the adjacency structure of thecorresponding template vertex Local constraints that involve avertex and its neighborhood can be checked by having verticescommunicate their (tentative) template match(s) with theirone-hop neighbors in the solution subgraph Glowast(Vlowast Elowast) (iethe currently pruned background graph) We call this process

Identify leaf vertices with unique label

Identify vertices with

duplicate labelNon-edge

monocyclicCycle constraints to verify

(1) (2) (3) (2) (3)(1) (2) (3) (4)

Path constraints to verify between a pair of vertices that are not part of a cycle and have identical labels

(1)

Identical labels

Template

Sub-template

Step 1 Step 2 Step 3 - Cycle Constraints (CC) Step 4 - Path Constraints (PC) Step 5 - TDS Constraints

Fig 3 A high-level depiction of non-local constraint generation for the template in Fig 1 The figure shows the steps to generate required cycle constraints(CC) path constraints (PC) and higher-order constraints requiring template-drive search (TDS) (Due to limited space the figure presents only a subset of thepath and TDS constraints generated however sufficient to guarantee 100 precision)

LCC1

NLCC - PC NLCC - CC

NLCC - TDS

NLCC - CC

NLCC - TDS

LCC1

LCC1 LCC2 LCC1 LCC2

1Iteration 2 3 4 5

6 7 8 10 119Iteration

Fig 4 Algorithm walk through for the example background graph and template in Fig 1 depicting which vertices and edges in Glowast(Vlowast Elowast) are eliminated(in solid grey) during each iteration The NLCC constraints for G0 are listed in Fig 3 Due to space limitations the example does not show the applicationof some of the constraints in Fig 3 that do not eliminate vertices or edges

Local Constraint Checking (LCC) Our experiments show thatLCC is responsible for removing the bulk of non-matchingvertices and edges

Some classes of templates (with cycles or repeated vertex la-bels) require additional routines to check non-local propertiesand to guarantee that all non-matching vertices are eliminated(Fig 2 highlights the need for these additional checks) Tosupport arbitrary templates we have developed a processwhich we dub Non-local Constraint Checking (NLCC) firstbased on the search template G0 we generate the set ofconstraints K0 that are to be verified then prune the graphusing each of them

Alg 1 presents an overview of our solution This sectionprovides high-level descriptions of the local and non-localconstraint checking routines while sectIV provides the detaileddistributed algorithms For better understanding Fig 4 illus-trates the complete workflow for the graph and pattern in Fig 1for which constraint generation is detailed in Fig 3

Algorithm 1 Main Pruning Loop1 Input background graph G(V E) template G0(V0 E0)2 Output the solution subgraph Glowast(Vlowast Elowast)3 generate non-local constraint set K0 from G0(V0 E0)4 Glowast larr LOCAL CONSTRAINT CHECKING (GG0)5 while K0 is not empty do6 pick and remove next constraint C0 from K0

7 Glowast larr NON LOCAL CONSTRAINT CHECKING (GlowastG0 C0)8 if any vertex has been eliminated or9 has one of its potential matches remove then

10 Glowast larr LOCAL CONSTRAINT CHECKING (GlowastG0)

Local Constraint Checking (LCC) involves a vertex andits neighborhood The algorithm performs the following twooperations (i) Vertex elimination the algorithm excludes thevertices that do not have a corresponding label in the templatethen iteratively excludes the vertices that do not have neigh-bors as labeled in the template For templates that have vertices

with multiple neighbors with the same label the algorithmverifies if a matching vertex in the background graph has aminimum number of distinct neighbors with the same label asprescribed in the template (ii) Edge elimination this excludesedges to eliminated neighbors and edges to neighbors whoselabels do not match the labels prescribed in the adjacencystructure of its corresponding template vertex (eg Fig 4Iteration 1) Edge elimination is crucial for scalability sincein a distributed setting no messages are sent over eliminatededges significantly improving the overall efficiency of thesystem (evaluated in sectV Fig 8)

Non-local Constraint Checking (NLCC) aims to excludevertices that fail to meet topological and label constraintsbeyond the one-hop neighborhood that is covered by LCC(Fig 2) We have identified three types of non-local constraintswhich can be verified independently (i) Cycle Constraints(CC) (ii) Path Constraints (PC) and (iii) constraints thatrequire Template-Driven Search (TDS) (see Remark 1) Forarbitrary templates TDS constraints based on aggregatingmultiple pathscycles enable further pruning and when basedon full template enumeration1 insure that pruning yieldsno false positives Checking TDS constraints however canbe expensive To reduce the overall cost we first generatesingle cycle- and path-based constraints which are usally lesscostly to verify and prune the graph before using TDS (theeffectiveness of this ordering is evaluated in Fig 14(c))

High-level Algorithmic Approach Regardless of constrainttype NLCC leverages a token passing approach tokens areissued by background graph vertices whose correspondingtemplate vertices are identified to have non-local constraintsAfter a fixed number of steps we check if a token has arrived

1In a number of corner cases not all TDS constraints need to be aggregatedfor full precision we skip this discussion for lack of space

where expected (eg back to the originating vertex for check-ing the existence of a cycle) If not then the issuing vertex doesnot satisfy the required constraint and is eliminated Alongthe token path the system verifies that all expected labels areencountered and where necessary uses the path informationaccumulated with the token to verify that differentrepeatednode identity constraint expectations are met Next we discusshow each of the constraints is verified

Cycle Constraints (CC) Higher-order structures within Gthat survive LCC but do not contain G0 are possible if G0contains a cycle (this happens if G contains one or moreunrolled cycles as in Fig 2 Template (a)) To address thiswe directly check for cycles of the correct length

Path Constraints (PC) If the template G0 has two or morevertices with the same label three or more hops away fromeach other then structures in G that survive LCC yet containno match are possible (Fig 2 Template (b)) Thus for everyvertex pair with the same label in G0 we directly checkthe existence of a path of correct length and label sequencefor prospective matching vertices in Glowast Opposite to cyclechecking after a fixed number of steps a token must bereceived by a vertex different from the initiating vertex butwith an identical label

TDS Constraints These are partial or complete (ie in-cluding all edges of the template) walks on the template Thetoken walks the constraint in the background graph and verifiesthat each node visited meets its neighborhood constraints(Remark 1) - in our distributed memory setting this is done bymaintaining a history of the walk and checking that previouslyvisited vertices are revisited as expected TDS constraints arecrucial to guarantee no-false positives for templates that arenon-edge-monocyclic or have repeated labels (Fig 2)

Token Generation For CC and TDS constraints a tokenmust be initiated from each vertex that may participate in thesubstructure whereas for PC tokens are only initiated fromterminal vertices

Further Optimization - Work Aggregation All NLCC con-straints attempt to identify if a walk exists from a specificvertex and though vertices with specific labels Since the goalis to identify the existence of any such path and multipleintermediate paths in the background graph often exist toprevent combinatorial explosion our duplicate work detectionmechanism prevents an intermediately vertex (in the tokenpath) from forwarding a duplicate token (evaluated in sectV-F)

Non-local Constraint Generation For a relatively large classof templates (ie those with unique labels and mono-cyclicedges) LCC and CC are sufficient to generate a precise solu-tion For the general case we generate non-local constraintsusing the following heuristic (Fig 3 shows for a giventemplate the non-local constraints to be verified and Fig 4shows how pruning progresses) First all the leaf vertices withunique labels are identified and ignored from this process (asLCC guarantees pruning if there is no match) Next if thetemplate has cycles then individual cycles are identified (egFig 3 Step 3) and a cycle constraint is generated for eachcycle Next vertices with identical label are identified and all

path constraints are generated for all such pairs (eg Fig 3Step 4 pentagonal vertices)

Finally we identify TDS constraints in three steps Firstfor templates with multiple cycles sharing more than oneedge (ie non-edge-monocyclic) a TDS cyclic constraint isgenerated through union of previously identified cycle con-straints This results in a higher-order cyclic structure with amaximal set of edges that cover all the non-edge-mono cycles(eg Fig 3 Step 5(1)) Second for templates with repeatedlabels a new TDS constraint is generated through a unionof all previously identified path constraints This proceduregenerates higher order structure that covers all the templatevertices with repeated labels (eg Fig 3 Step 5(2)) Thefinal step generates a TDS constraint as the union of thepreviously identified two constraints (eg Fig 3 Step 5(3))(Note that the above is a heuristic more constraints could begenerated by creating various possible combinations of cyclesand paths Only this third step is mandatory to eliminate allfalse positives)

IV ASYNCHRONOUS DISTRIBUTED ALGORITHMS

This section presents the constraint checking algorithms ontop of HavoqGT [23] a MPI-based framework that supportsasynchronous graph algorithms in distributed environmentsOur choice for HavoqGTrsquos is driven by multiple considera-tions first unlike most graph processing frameworks that onlysupport the Bulk Synchronous Parallel (BSP) model HavoqGThas been designed to support asynchronous algorithms essen-tial to achieve high-performance second the framework hasexcellent scaling properties [24] [21] and finally it enablesload balancing HavoqGTrsquos delegate partitioned graph dis-tributes the edges of each high-degree vertices across multiplecompute nodes which is crucial for achieving scalability forscale-free graphs with skewed degree distribution

In HavoqGT graph algorithms are implemented as vertex-callbacks the user-defined visit() callback can only access andupdate the state of a vertex The framework offers the abilityto generate events (aka lsquovisitorsrsquo in HavoqGT lingo) thattrigger this callback - either at the entire graph level usingthe do traversal() method or for a neighboring vertex usingthe push(visitor) call (this enables asynchronous vertex-to-vertex communication) The asynchronous graph computationcompletes when all events have been processed which is de-termined by a distributed quiescence detection algorithm [25]

Alg 1 outlines the key steps of the graph pruning procedureBelow we describe the distributed implementation of the localand non-local constraint checking routines Alg 2 lists thestate maintained at each vertex and its initialization

Local Constraint Checking is implemented as an iterativeprocess (Alg 3 and the corresponding callback Alg 4) Eachiteration initiates an asynchronous traversal by invoking thedo traversal() method and as a result each active vertexreceives a visitor with msgtype = init In the triggered visit()callback if the label of a vertex vj in the graph is a match forthe label of any vertex in the template and the vertex is still

Algorithm 2 Vertex State and Initialization1 status of vertex vj α(vj)larr true (active) if existq isin V0 st `(vj) = `(q)

otherwise false (ie vj has been pruned)2 set of possible matches in template for vertex vj ω(vj) larr initially allqk isin V0 st `(qk) = `(vj)

3 map of active edges of vertex vj ε(vj)larr keys are initialized to adj(vj)The value field which is initially empty is set to ω(vi) for each vi isin ε(vj)that has communicated its state to vj

4 set of already forwarded tokens by vertex vj τ(vj) larr initially emptyused for work aggregation in NLCC

active it creates visitors for all its active neighbors in ε(vj)with msgtype = alive (Alg 4 line 9) When a vertex vj isvisited with msgtype = alive it verifies whether the sendervertex vs satisfies one of its own (vjrsquos) template constraintsby invoking the function η(vs vj) By the end of an iterationif vj satisfies all the template constraints ie it has neighborswith required labels (and if needed a minimum of distinctneighbors with the same label as prescribed in the template)it stays active (ie α(vj) = true) for the next iteration Fortemplates that have multiple vertices with the same label inany iteration a vertex with that label in the background graphcould match any of these vertices in the template so eachmach must be verified independently If vj fails to satisfy therequired constraints for a template vertex qk isin ω(vj) qk isremoved from ω(vj) At any stage if ω(vj) becomes emptythen vj is marked inactive (α(vj)larr false) and never createsvisitors again Edge elimination excludes two categories ofedges first the edges to neighbors vi isin ε(vj) from which vjdid not receive an alive message and second the edges toneighbors whose labels do not match the labels prescribed inthe adjacency structure of the corresponding template vertex(s)in ω(vj) A vertex vj is also marked inactive if its active edgelist ε(vj) becomes empty Iterations continue until no vertexandor edge is marked inactive

Algorithm 3 Local Constraint Checking1 η(vs vj) - tests if vs satisfies a local constraint of vj returns ω(s)2 if constraints are satisfied empty otherwise3 procedure LOCAL CONSTRAINT CHECKING (GG0)4 do5 do traversal(msgtype larr init)6 barrier7 for all vj isin V do8 ωprime larr empty set of matches in template for neighbors of vj9 for all vi isin ε(vj) do

10 if η(vi vj) = empty then11 ε(vj)remove(vi) edge eliminated12 continue13 else14 ωprime larr ωprime cup η(vi vj) accum matches of the nbrs15 reset the value field of vi isin ε(vj) for the next iteration16 for all qk isin ω(vj) do for each potential match17 if adj(qk) ωprime then18 qk does not meet neighbor requirements19 ω(vj)remove(qk) remove from potential matches20 continue21 if ε(vj) = empty or ω(vj) = empty then22 α(vj)larr false vertex eliminated23 while vertices or edges are eliminated global detection

Algorithm 4 Local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state vs - vertex that originated the visitor3 visitor state ω(vs) - set of possible matches in template for vertex vs4 visitor state msgtype - init or alive5 procedure VISIT(G vq) vq - visitor queue6 if α(vj) = false then return7 if msgtype = init then8 for all vi isin ε(vj) do9 vis larr LCC VISITOR(vi vj ω(vj) alive)

10 vqpush(vis)

11 else if msgtype = alive then12 ε(vj)get(vs)larr ω(vs)

Non-local Constraint Checking routine iterates over K0 theset of non-local constraints to be checked and validates eachC0 isin K0 one at a time Alg 5 describes the solution to verify asingle constraint tokens are initiated through an asynchronoustraversal by invoking the do traversal() method Each activevertex vj isin G that is a potential match for the vertex q0 at thehead of a path C0 broadcasts a token to all its active neighborsin ε(vj) A map γ is used to track these token issuers A tokenis a tuple (t r) where t is an ordered list of vertices that haveforwarded the token and r is the hop-counter t0 isin t is thetoken-issuing vertex in G The ordered list t is essential forTDS since it enables detection of distinct vertices with thesame label in the token path

Algorithm 5 Non-local Constraint Checking1 procedure NON LOCAL CONSTRAINT CHECKING(GG0 C0)2 γ larr map of token source vertices (in G) for C0 the value field3 (initialized to false) is set to true if the token source vertex meets4 the requirements of C05 do traversal(msgtype larr init)6 barrier7 for all vj isin γ do8 if γget(vj) 6= true then9 violates C0 eliminate potential match

10 ω(vj)remove(q0) where q0 is the first vertex in C011 if ω(vj) = empty then no potential match left12 α(vj)larr false vertex eliminated13 forallvj isin V reset τ(vj)

When an active vertex vj receives a token with msgtype =forward it verifies that if ω(vj) is a match for the next entryin C0 if it has received the token from a valid neighbor (withrespect to entries in C0) and that the current hop count is lt|C0| If these requirements are satisfied (ie micro returns true) vjsets itself as the forwarding vertex (added to t) increments thehop count and broadcasts the token to all its active neighborsin ε(vj) If any of the constraints are not met vj drops thetoken If the hop count r is equal to |C0| and vj is the same asthe source vertex in the token for a cyclic template a path hasbeen found and vj is marked true in γ For path constraintsan acknowledgement is sent to the token issuer to update itsstatus in γ (Alg 6 lines 28 ndash 31) Once verification ofa constraint C0 has been completed the vertices that are notmarked true in γ are invalidated ie α(vj)larr false (Alg 5line 12) NLCC uses an unordered set τ(vj) (Alg 2 line 4)for work aggregation (see Alg 6 line 14) at each vertexthis is used to detect if another copy of token has alreadyvisited the vertex vj taking a different path

Algorithm 6 Non-local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state token - the token is a tuple (t r) where t is an ordered list

of vertices that have forwarded the token and r is the hop-counter t0 isin tis the vertex that originated the token

3 visitor state msgtype - init forward or ack4 micro(vj C0 token) - tests if vj satisfies requirements of C0 for the current

state of token returns true if constraints are met false otherwise5 procedure VISIT(G vq)6 if α(vj) = false then return7 if msgtype = init and existqk isin ω(vj) where qk = q0 isin C0 then8 initiate a token vj is the token source9 tadd(vj) r larr 1 tokenlarr (t r) γinsert(vj false)

10 for all vi isin ε(vj) do11 vis larr NLCC VISITOR(vi token forward)12 vqpush(vis)

13 else if msgtype = forward then vj received a token14 if token isin τ(vj) then work aggregation optimization15 τ(vj)insert(token)16 else return ignore token if it was previously forwarded by vj17 if micro(vj C0 token) = true and tokenr lt |C0| then18 the walk can be extended with vj and it has not yet reached

|C0| length19 tokentadd(vj) tokenr larr tokenr + 120 for all vi isin ε(vj) do forward the token21 vis larr NLCC VISITOR(vi token forward)22 vqpush(vis)

23 else if micro(vj C0 token) = true and tokenr = |C0| then24 the walk can be extended with vj and it has reached

|C0| length25 if C0 is cyclic and t0 = vj then26 γget(vj)larr true return vj meets requirements of C027 else if C0 is acyclic and t0 6= vj then28 vis larr NLCC VISITOR(t0 token ack)29 vqpush(vis) send ack to the token originator t0 isin t30 else if msgtype = ack then31 γget(vj)larr true return vj meets requirements of C0

Termination Output and Match Enumeration Queries IfNLCC is not required the search terminates when no vertexis eliminated (or none of its potential matches is removed)in an LCC iteration Otherwise the search terminates whenall constraints in K0 have been verified and no vertex iseliminated (or none of its potential matches is removed) inthe following LCC phase The output is (i) the set of verticesand edges that survived the iterative elimination process and(ii) for each vertex in this set the mapping in the templatewhere a match has been identified A distributed enumerationor counting routine can operate on the pruned graph withthis information Alg 6 can be slightly modified to obtainthe enumeration of the matches in the background graph theconstraint used is the full template work aggregation is turnedoff and each possible match is verified

Metadata Store Metadata is stored independent of the graphtopology itself (which uses CSR format [26]) At initializationonly the required attributes are read from the file(s) stored ona distributed file system A light-weight distributed processbuilds the in-memory (or memory-mapped) metadata store On256 nodes for the 257 billion edge Web Data Commons graph[27] the metadata store can be built in under two minutesAlthough in this paper we consider vertex metadata (ielabels) only edge metadata is also supported

V EVALUATION

We present strong (sectV-B) and weak (sectV-A) scaling experi-ments of pruning on massive real-world and synthetic graphsadditionally we demonstrate full match enumeration startingfrom the pruned graph (sectV-C) we evaluate the effective-ness of the optimizations our system incorporates (sectV-F) wehighlight the use of our system in the context of realisticdata analytics scenarios (sectV-D) we explore time-to-solutionvs precisionguarantees trade-offs (sectV-E) and finally wecompare our solution with a recent work QFrag [28] (sectV-H)

Testbed The testbed is the 26PFlop Quartz cluster at theLawrence Livermore National Lab comprised of 2634 nodesand the Intel Omni-Path interconnect Each node has two 18-core Intel Xeon E5-2695v4 210GHz processors and 128GBof memory [29] We run one MPI process per core

Datasets We summarize the main characteristics of thedatasets used for evaluation and explain how we have gener-ated vertex labels where necessary For all graphs we createdundirected versions of the graphs two directed edges are usedto represent each undirected edge

The datasets used for evaluationType |V| 2|E| dmax

Web Data Commons [27] Real 35B 257B 95MReddit [30] Real 39B 14B 19MInternet Movie Database [31] Real 5M 29M 552KPatent [28] Real 27M 28M 789Youtube [28] Real 46M 88M 25KR-MAT up to Scale 37 [22] Synthetic 137B 44T 612M

Web Data Commons (WDC) graph is a web-graph whosevertices are webpages and edges are hyperlinks To createvertex labels we extract the top-level domain names fromthe webpage URL eg org or edu If the URL containsa common second-level domain name it is chosen over thetop-level domain name For example for oxacuk ac isselected as the vertex label A total of 2903 unique labelsare distributed among the 35B graph vertices We curatedthe Reddit social-media graph from an open archive [30]of billions of public posts and comments from RedditcomReddit allows its users to rate (upvote or downvote) othersrsquoposts and comments The graph has four types of verticesAuthor Post Comment and Subreddit (a category for posts)For the Post and Comment there are three possible labelsPositive Negative and Neutral (indicating the overall balanceof positive and negative votes) or No rating An edge ispossible between an Author and a Post an Author and aComment a Subreddit and a Post a Post and a Comment(to that post) and between two Comments that have a parent-child relationship The International Movie Database (IMDB)graph was curated from the publicly available data [31] Thegraph has five types of vertices Movie Genre Actress Actorand Director An edge is only possible between a Movie typevertex and a non-Movie type vertex We use the smaller Patentand YouTube graphs to compare published results by Serafiniet al [28] The synthetic R-MAT graphs exhibit approximatepower-law degree distribution These graphs were createdfollowing the Graph 500 [32] standards 2Scale vertices and an

2

4

3 7

5

3 7

TreeChain

7354723

Fig 5 Chain and Tree patterns used Both patterns have two pairs of verticeswith the same (numeric) label hence require non-local constraint checking(NLCC) more precisely path checking The labels used are the most frequentin the R-MAT graphs and cover sim30 of all the vertices in the graphs

Chain

28

620

5

16

020

5

15

50

56

5

41

990

6

16

34

58

5

499

182

6

86

35

0 7

15

08

95

6 6

2|E|

Tree

58

74 6

64

68 6

33

728

6

10

712

6

531

82 6

111

786

6

306

36 7

325

990

6PF (10x)

Compute nodes

R-MAT scale

0

9

18

27

36

45

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34 1

28

35 2

56

36 5

12

37

10

24

Tim

e (s

)

0

10

20

30

40

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34

12

8

35

25

6

36

51

2

37

102

4

109

225

6 5

36

36

04

7

165

71 4

93

05 5

896

27 5

244

03 5

94

85

9 5

289

312

4

50

26

3 6

876

37

0 5

|V| 33

35

5

37

04

5

19

02

5 5

61

57

6

30

35

6 5

63

48

2 5

17

66

8 6

18

57

21 5PF (10x)

633

889

5

20

77

29 5

635

350

636

977

3 5

256

77

0 7

147

96

8 5

LCC NLCC

Fig 6 Runtime and pattern selectivity for weak scaling experiments brokendown to individual iterations for the chain (left) and tree (right) patternspresented in Fig 5 The X-axis labels present the R-MAT scale and the nodecount used for the experiment (Each node hosts two processors each with 18cores) The number of vertices and edges in each pruned solution is shownon the top of respective bar plot The pruning factors (PF) ie the orderof magnitude reduction in number of verticesedges compared to the originalbackground graph are also shown A flat line indicates perfect weak scalingTime for LLC and NLCC phases is presented with different colors

undirected edge factor of 16 For example a Scale 30 graphhas |V| = 230 and |E| asymp 32 times 230 (as we create a directedversion) We leverage degree information to create vertexlabels computed using the formula `(vi) = dlog2(d(vi)+1)e

Search Templates To stress our system we use templatesbased on patterns naturally occurring and relatively frequentin the template graphs The R-MAT (Fig 5) and WDC (Fig 7)patterns include vertex labels that are among the most frequentin the respective graphs The Reddit and IMDB patterns(Fig 11) include most of the vertex labels in these two graphsWe chose templates to exercise different constraint checkingscenarios the search templates have repeated vertex metadataand non-edge-monocyclic properties

Experimental Methodology All runtime numbers providedare averages over 10 runs For weak scaling experimentswe do not present scaling numbers for a single node asthis experiment does not involve network communication andbenefits from data locality For strong scaling experimentsthe smallest experiment uses 64 nodes as this is the lowestnumber of nodes that can load the graph in memory

A Weak Scaling Experiments

To evaluate the ability to process massive graphs we useweak scaling experiments and the synthetic R-MAT graphsup to Scale 37 (sim44T edges) and up to 1024 nodes (36864cores) Fig 5 shows the two search pattern used and Fig 6presents the runtimes Since there are multiple vertices in the

WDC-1 WDC-3WDC-2

info

net edugovde

ac

org

gov edu

net

biz

info

mil ac

ca

gov edu

edu

edu

biz

deit

ac acmil

Fig 7 WDC patterns using topsecond-level domain names as labels Thelabels selected are among the most frequent covering sim22 of the verticesin the WDC graph org covers sim220M vertices the 2nd most frequent aftercom and mil is the least frequent among these labels coveringsim153K vertices

WDC-1 WDC-2 WDC-347232 117 81913|V|

020406080

100120

64

12

8

25

6

51

2

10

24

0

10

20

30

40

64

12

8

25

6

51

2

10

24

0

17

34

51

68

85

102

119

136

153

170

64

12

8

25

6

51

2

10

24

126212 546 2550222|E|

VE Only

VE and EE

0

10

20

30

40

50

64

12

8

25

6

51

2

10

24

40

1130

2220

3310

4400

64

12

8

25

6

51

2

10

24

LCC NLCC

Tim

e (s

)

Compute nodes

15x

21x27x

18x

35x

66x

18x

29x36x

112x 60x

31x

Fig 8 Runtime for strong scaling experiments broken down to individualphases (LCC and NLCC are in different colors) for the three patterns presentedin Fig 7 The top row of X-axis labels represent the number of compute nodes(Each node hosts two processors each with 18 cores) The last two rows arenumber of vertices and edges in the pruned graph respectively For bettervisibility for WDC-1 (left plots) runtime for different iterations are split intotwo scales on the Y-axis LCC and NLCC-path constraints are at the bottomand LCC and NLCC-TDS constraints are at the top To highlight the impactof edge elimination for WDC-3 (right plots) we present results with edgeelimination disabled (top) and enabled (bottom) Speedup over the 64 nodeconfiguration is also shown on the top of each stacked bar plot

pattern with identical labels (at more than two-hop distance)the patterns require NLCC - path checking - to ensure no falsepositives in the pruned graph We see steady scaling all theway to the Scale 37 graph which has sim44 trillion edgeson 1024 nodes (36864 cores) Runtime is broken down tothe individual iteration to evaluate scaling and the individualcontribution of each intermediate step As a graph gets prunedthe subsequent iterations require less time Fig 6 includes (atthe top of each bar) the final number of vertices and edgesthat participate in the respective patterns Note that the NLCCphases (needed to guarantee a precise solution) do not deleteany vertex or edge hence no further LCC phase is invoked

B Strong Scaling Experiments

Fig 8 shows the runtimes for strong scaling experiments whenusing the real-world WDC graph on up to 1024 nodes (36864cores) Intuitively pattern matching on the WDC graph isharder than on the R-MAT graphs as the WDC graph is bothdenser and has a more skewed degree distribution We usethe patterns presented in Fig 7 WDC-1 is acyclic yet hasmultiple vertices with the same label and thus requires non-local constraint checking (PC and TDS) For better visibilitythe plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left) We noticenear perfect scaling for the LCC phases however some of

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 3: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

(v) Exploring Trade-offs and the Impact of Strategic DesignChoices and Optimizations Our approach has the added flexi-bility that search can be stopped early leading to the ability totrade faster time to an approximate solution (or even precisesolution yet without 100 precision guarantees) for precision(rate of false positives in the pruned graph) and precisionguarantees We also explore the impact of each optimizationused (sectV-F) the cumulative impact of these optimizations isa multiple orders of magnitude of runtime reduction bringingpattern matching on massive metadata graphs in the realm ofpossible graph analytics

II PRELIMINARIES

We aim to identify all structures within a large backgroundgraph G identical to a small connected template graph G0We describe general graph properties for G and use the samenotation (summarized in Table I) for other graph objects

A graph G(V E) is a collection of n vertices V =0 1 nminus 1 and m edges (i j) isin E where i j isin V (i isthe edgersquos source and j is the target) Here we only discusssimple (ie no self-edges) undirected vertex-labeled graphsalthough the techniques are applicable to directed non-simplegraphs with labels on both edges and vertices An undirectedG satisfies (i j) isin E if and only if (j i) isin E Vertex irsquosadjacency list adj(i) is the set of all j such that (i j) isin E A vertex-labeled graph also has a set of n` labels L of whicheach vertex i isin V has an assignment `(i) isin L

A walk in G is an ordered subsequence of V where eachconsecutive pair is an edge in E A walk with no repeatedvertices is a path A path with equal first and last vertex is acycle An acyclic graph has no cycles

We discuss several graph objects simultaneously the tem-plate graph G0(V0 E0) the background graph G(V E) andthe current solution subgraph Glowast(Vlowast Elowast) with Vlowast sub V andElowast sub E Our techniques iteratively refine Vlowast and Elowast until theyconverge to the union of all subgraphs of G that exactly matchthe template G0

For clarity when referring to vertices and edges from thetemplate graph G0 we will use the notation qi isin V0 and(qi qj) isin E0 Conversely we will use vi isin V and (vi vj) isin Efor vertices and edges from the background graph G or thesolution subgraph GlowastDefinition 1 A subgraph H(VH EH)VH sub V EH sub E is anexact match of template graph G0(V0 E0) (in notation H simG0) if there exists a bijective function φ V0 larrrarr VH with theproperties (Note that φ may not be unique for a given H)

(i) `(φ(q)) = `(q) for all q isin V0 and(ii) forall(q1 q2) isin E0 we have (φ(q1) φ(q2)) isin EH

(iii) forall(v1 v2) isin EH we have (φminus1(v1) φminus1(v2)) isin E0

Intuition for our Solution The algorithms we develop hereiteratively refines vertex-match functions ω(v) sub V0 such thatfor every v isin V ω(v) stores a super-set of all templatevertices v can possibly match Set ω(v) converges to containall possible values of φminus1(v) were v involved in one ormore matching subgraphs When a single constraint involving

TABLE I Symbolic notation usedObject(s) Notationtemplate graph vertices edges G0(V0 E0)template graph sizes n0 = |V0| m0 = |E0|template vertices V0 = q0 q1 qn0minus1template edges (qi qj) isin E0set of vertices adjacent to qi in G0 adj(qi)

background graph vertices edges G(V E)background graph sizes n = |V| m = |E|background vertices V = v0 v1 vnminus1background edges (vi vj) isin Eset of vertices adjacent to vi in G adj(vi)

maximum vertex degree in G dmax

label set L = 0 1 n` minus 1vertex label and label degree of qi `(qi) isin Lmatching subgraph vertices edges H(VH EH)solution subgraph vertices edges Glowast(Vlowast Elowast)vertex match function ω(vi) sub V0set of non-local constraints for G0 K0

q isin V0 is violatedunmet q is no longer a possibility for v ina match and q is removed ω(v)larr ω(v) q

Our previously developed pruning algorithms [15] requiredthat all vertex labels in the template G0 were unique and thatG0 be acyclic or edge-monocyclic to ensure 100 precisionOur new techniques achieve 100 precision for arbitrary G0

Remark 1 Given an ordered sequence of all n0 verticesq1 q2 qn0 sub V0 a simple (although potentially expen-sive) search from v1 isin Vlowast verifies if v1 is in a match withφ(q1) = v1 or not The search lists an ordered sequencev1 v2 vn0

sub Vlowast with φ defined as φ(qk) = vk Searchstep k proposes a new vk checking Def1 (i) and (ii) If allchecks are passed the search accepts vk and moves on to step(k + 1) but terminates if no such vk exists in Vlowast If the fulllist is generated with all label and edge checks passed thenthere exists a H sim G with VH = v1 v2 vn0

We call this Template-Driven Search (TDS) and develop anefficient distributed version in sectIV to apply to the solutionGlowast(Vlowast Elowast) If TDS has been applied successfully then thereare no false positives remaining

III GRAPH PRUNING FOR SCALABLE MATCHING

Our goal is to realize a technique which systematically elimi-nates all the vertices and edges that do not participate in anymatch H sim G0 This approach is motivated by viewing thetemplate G0 as specifying a set of constraints the vertices andedges that participate in a match must meet As a trivial exam-ple any vertex v whose label `(v) is not present in G0 cannotbe present in an exact match A vertex in an exact matchalso needs to have non-eliminated edges to non-eliminatedvertices labeled as prescribed in the adjacency structure of thecorresponding template vertex Local constraints that involve avertex and its neighborhood can be checked by having verticescommunicate their (tentative) template match(s) with theirone-hop neighbors in the solution subgraph Glowast(Vlowast Elowast) (iethe currently pruned background graph) We call this process

Identify leaf vertices with unique label

Identify vertices with

duplicate labelNon-edge

monocyclicCycle constraints to verify

(1) (2) (3) (2) (3)(1) (2) (3) (4)

Path constraints to verify between a pair of vertices that are not part of a cycle and have identical labels

(1)

Identical labels

Template

Sub-template

Step 1 Step 2 Step 3 - Cycle Constraints (CC) Step 4 - Path Constraints (PC) Step 5 - TDS Constraints

Fig 3 A high-level depiction of non-local constraint generation for the template in Fig 1 The figure shows the steps to generate required cycle constraints(CC) path constraints (PC) and higher-order constraints requiring template-drive search (TDS) (Due to limited space the figure presents only a subset of thepath and TDS constraints generated however sufficient to guarantee 100 precision)

LCC1

NLCC - PC NLCC - CC

NLCC - TDS

NLCC - CC

NLCC - TDS

LCC1

LCC1 LCC2 LCC1 LCC2

1Iteration 2 3 4 5

6 7 8 10 119Iteration

Fig 4 Algorithm walk through for the example background graph and template in Fig 1 depicting which vertices and edges in Glowast(Vlowast Elowast) are eliminated(in solid grey) during each iteration The NLCC constraints for G0 are listed in Fig 3 Due to space limitations the example does not show the applicationof some of the constraints in Fig 3 that do not eliminate vertices or edges

Local Constraint Checking (LCC) Our experiments show thatLCC is responsible for removing the bulk of non-matchingvertices and edges

Some classes of templates (with cycles or repeated vertex la-bels) require additional routines to check non-local propertiesand to guarantee that all non-matching vertices are eliminated(Fig 2 highlights the need for these additional checks) Tosupport arbitrary templates we have developed a processwhich we dub Non-local Constraint Checking (NLCC) firstbased on the search template G0 we generate the set ofconstraints K0 that are to be verified then prune the graphusing each of them

Alg 1 presents an overview of our solution This sectionprovides high-level descriptions of the local and non-localconstraint checking routines while sectIV provides the detaileddistributed algorithms For better understanding Fig 4 illus-trates the complete workflow for the graph and pattern in Fig 1for which constraint generation is detailed in Fig 3

Algorithm 1 Main Pruning Loop1 Input background graph G(V E) template G0(V0 E0)2 Output the solution subgraph Glowast(Vlowast Elowast)3 generate non-local constraint set K0 from G0(V0 E0)4 Glowast larr LOCAL CONSTRAINT CHECKING (GG0)5 while K0 is not empty do6 pick and remove next constraint C0 from K0

7 Glowast larr NON LOCAL CONSTRAINT CHECKING (GlowastG0 C0)8 if any vertex has been eliminated or9 has one of its potential matches remove then

10 Glowast larr LOCAL CONSTRAINT CHECKING (GlowastG0)

Local Constraint Checking (LCC) involves a vertex andits neighborhood The algorithm performs the following twooperations (i) Vertex elimination the algorithm excludes thevertices that do not have a corresponding label in the templatethen iteratively excludes the vertices that do not have neigh-bors as labeled in the template For templates that have vertices

with multiple neighbors with the same label the algorithmverifies if a matching vertex in the background graph has aminimum number of distinct neighbors with the same label asprescribed in the template (ii) Edge elimination this excludesedges to eliminated neighbors and edges to neighbors whoselabels do not match the labels prescribed in the adjacencystructure of its corresponding template vertex (eg Fig 4Iteration 1) Edge elimination is crucial for scalability sincein a distributed setting no messages are sent over eliminatededges significantly improving the overall efficiency of thesystem (evaluated in sectV Fig 8)

Non-local Constraint Checking (NLCC) aims to excludevertices that fail to meet topological and label constraintsbeyond the one-hop neighborhood that is covered by LCC(Fig 2) We have identified three types of non-local constraintswhich can be verified independently (i) Cycle Constraints(CC) (ii) Path Constraints (PC) and (iii) constraints thatrequire Template-Driven Search (TDS) (see Remark 1) Forarbitrary templates TDS constraints based on aggregatingmultiple pathscycles enable further pruning and when basedon full template enumeration1 insure that pruning yieldsno false positives Checking TDS constraints however canbe expensive To reduce the overall cost we first generatesingle cycle- and path-based constraints which are usally lesscostly to verify and prune the graph before using TDS (theeffectiveness of this ordering is evaluated in Fig 14(c))

High-level Algorithmic Approach Regardless of constrainttype NLCC leverages a token passing approach tokens areissued by background graph vertices whose correspondingtemplate vertices are identified to have non-local constraintsAfter a fixed number of steps we check if a token has arrived

1In a number of corner cases not all TDS constraints need to be aggregatedfor full precision we skip this discussion for lack of space

where expected (eg back to the originating vertex for check-ing the existence of a cycle) If not then the issuing vertex doesnot satisfy the required constraint and is eliminated Alongthe token path the system verifies that all expected labels areencountered and where necessary uses the path informationaccumulated with the token to verify that differentrepeatednode identity constraint expectations are met Next we discusshow each of the constraints is verified

Cycle Constraints (CC) Higher-order structures within Gthat survive LCC but do not contain G0 are possible if G0contains a cycle (this happens if G contains one or moreunrolled cycles as in Fig 2 Template (a)) To address thiswe directly check for cycles of the correct length

Path Constraints (PC) If the template G0 has two or morevertices with the same label three or more hops away fromeach other then structures in G that survive LCC yet containno match are possible (Fig 2 Template (b)) Thus for everyvertex pair with the same label in G0 we directly checkthe existence of a path of correct length and label sequencefor prospective matching vertices in Glowast Opposite to cyclechecking after a fixed number of steps a token must bereceived by a vertex different from the initiating vertex butwith an identical label

TDS Constraints These are partial or complete (ie in-cluding all edges of the template) walks on the template Thetoken walks the constraint in the background graph and verifiesthat each node visited meets its neighborhood constraints(Remark 1) - in our distributed memory setting this is done bymaintaining a history of the walk and checking that previouslyvisited vertices are revisited as expected TDS constraints arecrucial to guarantee no-false positives for templates that arenon-edge-monocyclic or have repeated labels (Fig 2)

Token Generation For CC and TDS constraints a tokenmust be initiated from each vertex that may participate in thesubstructure whereas for PC tokens are only initiated fromterminal vertices

Further Optimization - Work Aggregation All NLCC con-straints attempt to identify if a walk exists from a specificvertex and though vertices with specific labels Since the goalis to identify the existence of any such path and multipleintermediate paths in the background graph often exist toprevent combinatorial explosion our duplicate work detectionmechanism prevents an intermediately vertex (in the tokenpath) from forwarding a duplicate token (evaluated in sectV-F)

Non-local Constraint Generation For a relatively large classof templates (ie those with unique labels and mono-cyclicedges) LCC and CC are sufficient to generate a precise solu-tion For the general case we generate non-local constraintsusing the following heuristic (Fig 3 shows for a giventemplate the non-local constraints to be verified and Fig 4shows how pruning progresses) First all the leaf vertices withunique labels are identified and ignored from this process (asLCC guarantees pruning if there is no match) Next if thetemplate has cycles then individual cycles are identified (egFig 3 Step 3) and a cycle constraint is generated for eachcycle Next vertices with identical label are identified and all

path constraints are generated for all such pairs (eg Fig 3Step 4 pentagonal vertices)

Finally we identify TDS constraints in three steps Firstfor templates with multiple cycles sharing more than oneedge (ie non-edge-monocyclic) a TDS cyclic constraint isgenerated through union of previously identified cycle con-straints This results in a higher-order cyclic structure with amaximal set of edges that cover all the non-edge-mono cycles(eg Fig 3 Step 5(1)) Second for templates with repeatedlabels a new TDS constraint is generated through a unionof all previously identified path constraints This proceduregenerates higher order structure that covers all the templatevertices with repeated labels (eg Fig 3 Step 5(2)) Thefinal step generates a TDS constraint as the union of thepreviously identified two constraints (eg Fig 3 Step 5(3))(Note that the above is a heuristic more constraints could begenerated by creating various possible combinations of cyclesand paths Only this third step is mandatory to eliminate allfalse positives)

IV ASYNCHRONOUS DISTRIBUTED ALGORITHMS

This section presents the constraint checking algorithms ontop of HavoqGT [23] a MPI-based framework that supportsasynchronous graph algorithms in distributed environmentsOur choice for HavoqGTrsquos is driven by multiple considera-tions first unlike most graph processing frameworks that onlysupport the Bulk Synchronous Parallel (BSP) model HavoqGThas been designed to support asynchronous algorithms essen-tial to achieve high-performance second the framework hasexcellent scaling properties [24] [21] and finally it enablesload balancing HavoqGTrsquos delegate partitioned graph dis-tributes the edges of each high-degree vertices across multiplecompute nodes which is crucial for achieving scalability forscale-free graphs with skewed degree distribution

In HavoqGT graph algorithms are implemented as vertex-callbacks the user-defined visit() callback can only access andupdate the state of a vertex The framework offers the abilityto generate events (aka lsquovisitorsrsquo in HavoqGT lingo) thattrigger this callback - either at the entire graph level usingthe do traversal() method or for a neighboring vertex usingthe push(visitor) call (this enables asynchronous vertex-to-vertex communication) The asynchronous graph computationcompletes when all events have been processed which is de-termined by a distributed quiescence detection algorithm [25]

Alg 1 outlines the key steps of the graph pruning procedureBelow we describe the distributed implementation of the localand non-local constraint checking routines Alg 2 lists thestate maintained at each vertex and its initialization

Local Constraint Checking is implemented as an iterativeprocess (Alg 3 and the corresponding callback Alg 4) Eachiteration initiates an asynchronous traversal by invoking thedo traversal() method and as a result each active vertexreceives a visitor with msgtype = init In the triggered visit()callback if the label of a vertex vj in the graph is a match forthe label of any vertex in the template and the vertex is still

Algorithm 2 Vertex State and Initialization1 status of vertex vj α(vj)larr true (active) if existq isin V0 st `(vj) = `(q)

otherwise false (ie vj has been pruned)2 set of possible matches in template for vertex vj ω(vj) larr initially allqk isin V0 st `(qk) = `(vj)

3 map of active edges of vertex vj ε(vj)larr keys are initialized to adj(vj)The value field which is initially empty is set to ω(vi) for each vi isin ε(vj)that has communicated its state to vj

4 set of already forwarded tokens by vertex vj τ(vj) larr initially emptyused for work aggregation in NLCC

active it creates visitors for all its active neighbors in ε(vj)with msgtype = alive (Alg 4 line 9) When a vertex vj isvisited with msgtype = alive it verifies whether the sendervertex vs satisfies one of its own (vjrsquos) template constraintsby invoking the function η(vs vj) By the end of an iterationif vj satisfies all the template constraints ie it has neighborswith required labels (and if needed a minimum of distinctneighbors with the same label as prescribed in the template)it stays active (ie α(vj) = true) for the next iteration Fortemplates that have multiple vertices with the same label inany iteration a vertex with that label in the background graphcould match any of these vertices in the template so eachmach must be verified independently If vj fails to satisfy therequired constraints for a template vertex qk isin ω(vj) qk isremoved from ω(vj) At any stage if ω(vj) becomes emptythen vj is marked inactive (α(vj)larr false) and never createsvisitors again Edge elimination excludes two categories ofedges first the edges to neighbors vi isin ε(vj) from which vjdid not receive an alive message and second the edges toneighbors whose labels do not match the labels prescribed inthe adjacency structure of the corresponding template vertex(s)in ω(vj) A vertex vj is also marked inactive if its active edgelist ε(vj) becomes empty Iterations continue until no vertexandor edge is marked inactive

Algorithm 3 Local Constraint Checking1 η(vs vj) - tests if vs satisfies a local constraint of vj returns ω(s)2 if constraints are satisfied empty otherwise3 procedure LOCAL CONSTRAINT CHECKING (GG0)4 do5 do traversal(msgtype larr init)6 barrier7 for all vj isin V do8 ωprime larr empty set of matches in template for neighbors of vj9 for all vi isin ε(vj) do

10 if η(vi vj) = empty then11 ε(vj)remove(vi) edge eliminated12 continue13 else14 ωprime larr ωprime cup η(vi vj) accum matches of the nbrs15 reset the value field of vi isin ε(vj) for the next iteration16 for all qk isin ω(vj) do for each potential match17 if adj(qk) ωprime then18 qk does not meet neighbor requirements19 ω(vj)remove(qk) remove from potential matches20 continue21 if ε(vj) = empty or ω(vj) = empty then22 α(vj)larr false vertex eliminated23 while vertices or edges are eliminated global detection

Algorithm 4 Local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state vs - vertex that originated the visitor3 visitor state ω(vs) - set of possible matches in template for vertex vs4 visitor state msgtype - init or alive5 procedure VISIT(G vq) vq - visitor queue6 if α(vj) = false then return7 if msgtype = init then8 for all vi isin ε(vj) do9 vis larr LCC VISITOR(vi vj ω(vj) alive)

10 vqpush(vis)

11 else if msgtype = alive then12 ε(vj)get(vs)larr ω(vs)

Non-local Constraint Checking routine iterates over K0 theset of non-local constraints to be checked and validates eachC0 isin K0 one at a time Alg 5 describes the solution to verify asingle constraint tokens are initiated through an asynchronoustraversal by invoking the do traversal() method Each activevertex vj isin G that is a potential match for the vertex q0 at thehead of a path C0 broadcasts a token to all its active neighborsin ε(vj) A map γ is used to track these token issuers A tokenis a tuple (t r) where t is an ordered list of vertices that haveforwarded the token and r is the hop-counter t0 isin t is thetoken-issuing vertex in G The ordered list t is essential forTDS since it enables detection of distinct vertices with thesame label in the token path

Algorithm 5 Non-local Constraint Checking1 procedure NON LOCAL CONSTRAINT CHECKING(GG0 C0)2 γ larr map of token source vertices (in G) for C0 the value field3 (initialized to false) is set to true if the token source vertex meets4 the requirements of C05 do traversal(msgtype larr init)6 barrier7 for all vj isin γ do8 if γget(vj) 6= true then9 violates C0 eliminate potential match

10 ω(vj)remove(q0) where q0 is the first vertex in C011 if ω(vj) = empty then no potential match left12 α(vj)larr false vertex eliminated13 forallvj isin V reset τ(vj)

When an active vertex vj receives a token with msgtype =forward it verifies that if ω(vj) is a match for the next entryin C0 if it has received the token from a valid neighbor (withrespect to entries in C0) and that the current hop count is lt|C0| If these requirements are satisfied (ie micro returns true) vjsets itself as the forwarding vertex (added to t) increments thehop count and broadcasts the token to all its active neighborsin ε(vj) If any of the constraints are not met vj drops thetoken If the hop count r is equal to |C0| and vj is the same asthe source vertex in the token for a cyclic template a path hasbeen found and vj is marked true in γ For path constraintsan acknowledgement is sent to the token issuer to update itsstatus in γ (Alg 6 lines 28 ndash 31) Once verification ofa constraint C0 has been completed the vertices that are notmarked true in γ are invalidated ie α(vj)larr false (Alg 5line 12) NLCC uses an unordered set τ(vj) (Alg 2 line 4)for work aggregation (see Alg 6 line 14) at each vertexthis is used to detect if another copy of token has alreadyvisited the vertex vj taking a different path

Algorithm 6 Non-local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state token - the token is a tuple (t r) where t is an ordered list

of vertices that have forwarded the token and r is the hop-counter t0 isin tis the vertex that originated the token

3 visitor state msgtype - init forward or ack4 micro(vj C0 token) - tests if vj satisfies requirements of C0 for the current

state of token returns true if constraints are met false otherwise5 procedure VISIT(G vq)6 if α(vj) = false then return7 if msgtype = init and existqk isin ω(vj) where qk = q0 isin C0 then8 initiate a token vj is the token source9 tadd(vj) r larr 1 tokenlarr (t r) γinsert(vj false)

10 for all vi isin ε(vj) do11 vis larr NLCC VISITOR(vi token forward)12 vqpush(vis)

13 else if msgtype = forward then vj received a token14 if token isin τ(vj) then work aggregation optimization15 τ(vj)insert(token)16 else return ignore token if it was previously forwarded by vj17 if micro(vj C0 token) = true and tokenr lt |C0| then18 the walk can be extended with vj and it has not yet reached

|C0| length19 tokentadd(vj) tokenr larr tokenr + 120 for all vi isin ε(vj) do forward the token21 vis larr NLCC VISITOR(vi token forward)22 vqpush(vis)

23 else if micro(vj C0 token) = true and tokenr = |C0| then24 the walk can be extended with vj and it has reached

|C0| length25 if C0 is cyclic and t0 = vj then26 γget(vj)larr true return vj meets requirements of C027 else if C0 is acyclic and t0 6= vj then28 vis larr NLCC VISITOR(t0 token ack)29 vqpush(vis) send ack to the token originator t0 isin t30 else if msgtype = ack then31 γget(vj)larr true return vj meets requirements of C0

Termination Output and Match Enumeration Queries IfNLCC is not required the search terminates when no vertexis eliminated (or none of its potential matches is removed)in an LCC iteration Otherwise the search terminates whenall constraints in K0 have been verified and no vertex iseliminated (or none of its potential matches is removed) inthe following LCC phase The output is (i) the set of verticesand edges that survived the iterative elimination process and(ii) for each vertex in this set the mapping in the templatewhere a match has been identified A distributed enumerationor counting routine can operate on the pruned graph withthis information Alg 6 can be slightly modified to obtainthe enumeration of the matches in the background graph theconstraint used is the full template work aggregation is turnedoff and each possible match is verified

Metadata Store Metadata is stored independent of the graphtopology itself (which uses CSR format [26]) At initializationonly the required attributes are read from the file(s) stored ona distributed file system A light-weight distributed processbuilds the in-memory (or memory-mapped) metadata store On256 nodes for the 257 billion edge Web Data Commons graph[27] the metadata store can be built in under two minutesAlthough in this paper we consider vertex metadata (ielabels) only edge metadata is also supported

V EVALUATION

We present strong (sectV-B) and weak (sectV-A) scaling experi-ments of pruning on massive real-world and synthetic graphsadditionally we demonstrate full match enumeration startingfrom the pruned graph (sectV-C) we evaluate the effective-ness of the optimizations our system incorporates (sectV-F) wehighlight the use of our system in the context of realisticdata analytics scenarios (sectV-D) we explore time-to-solutionvs precisionguarantees trade-offs (sectV-E) and finally wecompare our solution with a recent work QFrag [28] (sectV-H)

Testbed The testbed is the 26PFlop Quartz cluster at theLawrence Livermore National Lab comprised of 2634 nodesand the Intel Omni-Path interconnect Each node has two 18-core Intel Xeon E5-2695v4 210GHz processors and 128GBof memory [29] We run one MPI process per core

Datasets We summarize the main characteristics of thedatasets used for evaluation and explain how we have gener-ated vertex labels where necessary For all graphs we createdundirected versions of the graphs two directed edges are usedto represent each undirected edge

The datasets used for evaluationType |V| 2|E| dmax

Web Data Commons [27] Real 35B 257B 95MReddit [30] Real 39B 14B 19MInternet Movie Database [31] Real 5M 29M 552KPatent [28] Real 27M 28M 789Youtube [28] Real 46M 88M 25KR-MAT up to Scale 37 [22] Synthetic 137B 44T 612M

Web Data Commons (WDC) graph is a web-graph whosevertices are webpages and edges are hyperlinks To createvertex labels we extract the top-level domain names fromthe webpage URL eg org or edu If the URL containsa common second-level domain name it is chosen over thetop-level domain name For example for oxacuk ac isselected as the vertex label A total of 2903 unique labelsare distributed among the 35B graph vertices We curatedthe Reddit social-media graph from an open archive [30]of billions of public posts and comments from RedditcomReddit allows its users to rate (upvote or downvote) othersrsquoposts and comments The graph has four types of verticesAuthor Post Comment and Subreddit (a category for posts)For the Post and Comment there are three possible labelsPositive Negative and Neutral (indicating the overall balanceof positive and negative votes) or No rating An edge ispossible between an Author and a Post an Author and aComment a Subreddit and a Post a Post and a Comment(to that post) and between two Comments that have a parent-child relationship The International Movie Database (IMDB)graph was curated from the publicly available data [31] Thegraph has five types of vertices Movie Genre Actress Actorand Director An edge is only possible between a Movie typevertex and a non-Movie type vertex We use the smaller Patentand YouTube graphs to compare published results by Serafiniet al [28] The synthetic R-MAT graphs exhibit approximatepower-law degree distribution These graphs were createdfollowing the Graph 500 [32] standards 2Scale vertices and an

2

4

3 7

5

3 7

TreeChain

7354723

Fig 5 Chain and Tree patterns used Both patterns have two pairs of verticeswith the same (numeric) label hence require non-local constraint checking(NLCC) more precisely path checking The labels used are the most frequentin the R-MAT graphs and cover sim30 of all the vertices in the graphs

Chain

28

620

5

16

020

5

15

50

56

5

41

990

6

16

34

58

5

499

182

6

86

35

0 7

15

08

95

6 6

2|E|

Tree

58

74 6

64

68 6

33

728

6

10

712

6

531

82 6

111

786

6

306

36 7

325

990

6PF (10x)

Compute nodes

R-MAT scale

0

9

18

27

36

45

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34 1

28

35 2

56

36 5

12

37

10

24

Tim

e (s

)

0

10

20

30

40

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34

12

8

35

25

6

36

51

2

37

102

4

109

225

6 5

36

36

04

7

165

71 4

93

05 5

896

27 5

244

03 5

94

85

9 5

289

312

4

50

26

3 6

876

37

0 5

|V| 33

35

5

37

04

5

19

02

5 5

61

57

6

30

35

6 5

63

48

2 5

17

66

8 6

18

57

21 5PF (10x)

633

889

5

20

77

29 5

635

350

636

977

3 5

256

77

0 7

147

96

8 5

LCC NLCC

Fig 6 Runtime and pattern selectivity for weak scaling experiments brokendown to individual iterations for the chain (left) and tree (right) patternspresented in Fig 5 The X-axis labels present the R-MAT scale and the nodecount used for the experiment (Each node hosts two processors each with 18cores) The number of vertices and edges in each pruned solution is shownon the top of respective bar plot The pruning factors (PF) ie the orderof magnitude reduction in number of verticesedges compared to the originalbackground graph are also shown A flat line indicates perfect weak scalingTime for LLC and NLCC phases is presented with different colors

undirected edge factor of 16 For example a Scale 30 graphhas |V| = 230 and |E| asymp 32 times 230 (as we create a directedversion) We leverage degree information to create vertexlabels computed using the formula `(vi) = dlog2(d(vi)+1)e

Search Templates To stress our system we use templatesbased on patterns naturally occurring and relatively frequentin the template graphs The R-MAT (Fig 5) and WDC (Fig 7)patterns include vertex labels that are among the most frequentin the respective graphs The Reddit and IMDB patterns(Fig 11) include most of the vertex labels in these two graphsWe chose templates to exercise different constraint checkingscenarios the search templates have repeated vertex metadataand non-edge-monocyclic properties

Experimental Methodology All runtime numbers providedare averages over 10 runs For weak scaling experimentswe do not present scaling numbers for a single node asthis experiment does not involve network communication andbenefits from data locality For strong scaling experimentsthe smallest experiment uses 64 nodes as this is the lowestnumber of nodes that can load the graph in memory

A Weak Scaling Experiments

To evaluate the ability to process massive graphs we useweak scaling experiments and the synthetic R-MAT graphsup to Scale 37 (sim44T edges) and up to 1024 nodes (36864cores) Fig 5 shows the two search pattern used and Fig 6presents the runtimes Since there are multiple vertices in the

WDC-1 WDC-3WDC-2

info

net edugovde

ac

org

gov edu

net

biz

info

mil ac

ca

gov edu

edu

edu

biz

deit

ac acmil

Fig 7 WDC patterns using topsecond-level domain names as labels Thelabels selected are among the most frequent covering sim22 of the verticesin the WDC graph org covers sim220M vertices the 2nd most frequent aftercom and mil is the least frequent among these labels coveringsim153K vertices

WDC-1 WDC-2 WDC-347232 117 81913|V|

020406080

100120

64

12

8

25

6

51

2

10

24

0

10

20

30

40

64

12

8

25

6

51

2

10

24

0

17

34

51

68

85

102

119

136

153

170

64

12

8

25

6

51

2

10

24

126212 546 2550222|E|

VE Only

VE and EE

0

10

20

30

40

50

64

12

8

25

6

51

2

10

24

40

1130

2220

3310

4400

64

12

8

25

6

51

2

10

24

LCC NLCC

Tim

e (s

)

Compute nodes

15x

21x27x

18x

35x

66x

18x

29x36x

112x 60x

31x

Fig 8 Runtime for strong scaling experiments broken down to individualphases (LCC and NLCC are in different colors) for the three patterns presentedin Fig 7 The top row of X-axis labels represent the number of compute nodes(Each node hosts two processors each with 18 cores) The last two rows arenumber of vertices and edges in the pruned graph respectively For bettervisibility for WDC-1 (left plots) runtime for different iterations are split intotwo scales on the Y-axis LCC and NLCC-path constraints are at the bottomand LCC and NLCC-TDS constraints are at the top To highlight the impactof edge elimination for WDC-3 (right plots) we present results with edgeelimination disabled (top) and enabled (bottom) Speedup over the 64 nodeconfiguration is also shown on the top of each stacked bar plot

pattern with identical labels (at more than two-hop distance)the patterns require NLCC - path checking - to ensure no falsepositives in the pruned graph We see steady scaling all theway to the Scale 37 graph which has sim44 trillion edgeson 1024 nodes (36864 cores) Runtime is broken down tothe individual iteration to evaluate scaling and the individualcontribution of each intermediate step As a graph gets prunedthe subsequent iterations require less time Fig 6 includes (atthe top of each bar) the final number of vertices and edgesthat participate in the respective patterns Note that the NLCCphases (needed to guarantee a precise solution) do not deleteany vertex or edge hence no further LCC phase is invoked

B Strong Scaling Experiments

Fig 8 shows the runtimes for strong scaling experiments whenusing the real-world WDC graph on up to 1024 nodes (36864cores) Intuitively pattern matching on the WDC graph isharder than on the R-MAT graphs as the WDC graph is bothdenser and has a more skewed degree distribution We usethe patterns presented in Fig 7 WDC-1 is acyclic yet hasmultiple vertices with the same label and thus requires non-local constraint checking (PC and TDS) For better visibilitythe plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left) We noticenear perfect scaling for the LCC phases however some of

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 4: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

Identify leaf vertices with unique label

Identify vertices with

duplicate labelNon-edge

monocyclicCycle constraints to verify

(1) (2) (3) (2) (3)(1) (2) (3) (4)

Path constraints to verify between a pair of vertices that are not part of a cycle and have identical labels

(1)

Identical labels

Template

Sub-template

Step 1 Step 2 Step 3 - Cycle Constraints (CC) Step 4 - Path Constraints (PC) Step 5 - TDS Constraints

Fig 3 A high-level depiction of non-local constraint generation for the template in Fig 1 The figure shows the steps to generate required cycle constraints(CC) path constraints (PC) and higher-order constraints requiring template-drive search (TDS) (Due to limited space the figure presents only a subset of thepath and TDS constraints generated however sufficient to guarantee 100 precision)

LCC1

NLCC - PC NLCC - CC

NLCC - TDS

NLCC - CC

NLCC - TDS

LCC1

LCC1 LCC2 LCC1 LCC2

1Iteration 2 3 4 5

6 7 8 10 119Iteration

Fig 4 Algorithm walk through for the example background graph and template in Fig 1 depicting which vertices and edges in Glowast(Vlowast Elowast) are eliminated(in solid grey) during each iteration The NLCC constraints for G0 are listed in Fig 3 Due to space limitations the example does not show the applicationof some of the constraints in Fig 3 that do not eliminate vertices or edges

Local Constraint Checking (LCC) Our experiments show thatLCC is responsible for removing the bulk of non-matchingvertices and edges

Some classes of templates (with cycles or repeated vertex la-bels) require additional routines to check non-local propertiesand to guarantee that all non-matching vertices are eliminated(Fig 2 highlights the need for these additional checks) Tosupport arbitrary templates we have developed a processwhich we dub Non-local Constraint Checking (NLCC) firstbased on the search template G0 we generate the set ofconstraints K0 that are to be verified then prune the graphusing each of them

Alg 1 presents an overview of our solution This sectionprovides high-level descriptions of the local and non-localconstraint checking routines while sectIV provides the detaileddistributed algorithms For better understanding Fig 4 illus-trates the complete workflow for the graph and pattern in Fig 1for which constraint generation is detailed in Fig 3

Algorithm 1 Main Pruning Loop1 Input background graph G(V E) template G0(V0 E0)2 Output the solution subgraph Glowast(Vlowast Elowast)3 generate non-local constraint set K0 from G0(V0 E0)4 Glowast larr LOCAL CONSTRAINT CHECKING (GG0)5 while K0 is not empty do6 pick and remove next constraint C0 from K0

7 Glowast larr NON LOCAL CONSTRAINT CHECKING (GlowastG0 C0)8 if any vertex has been eliminated or9 has one of its potential matches remove then

10 Glowast larr LOCAL CONSTRAINT CHECKING (GlowastG0)

Local Constraint Checking (LCC) involves a vertex andits neighborhood The algorithm performs the following twooperations (i) Vertex elimination the algorithm excludes thevertices that do not have a corresponding label in the templatethen iteratively excludes the vertices that do not have neigh-bors as labeled in the template For templates that have vertices

with multiple neighbors with the same label the algorithmverifies if a matching vertex in the background graph has aminimum number of distinct neighbors with the same label asprescribed in the template (ii) Edge elimination this excludesedges to eliminated neighbors and edges to neighbors whoselabels do not match the labels prescribed in the adjacencystructure of its corresponding template vertex (eg Fig 4Iteration 1) Edge elimination is crucial for scalability sincein a distributed setting no messages are sent over eliminatededges significantly improving the overall efficiency of thesystem (evaluated in sectV Fig 8)

Non-local Constraint Checking (NLCC) aims to excludevertices that fail to meet topological and label constraintsbeyond the one-hop neighborhood that is covered by LCC(Fig 2) We have identified three types of non-local constraintswhich can be verified independently (i) Cycle Constraints(CC) (ii) Path Constraints (PC) and (iii) constraints thatrequire Template-Driven Search (TDS) (see Remark 1) Forarbitrary templates TDS constraints based on aggregatingmultiple pathscycles enable further pruning and when basedon full template enumeration1 insure that pruning yieldsno false positives Checking TDS constraints however canbe expensive To reduce the overall cost we first generatesingle cycle- and path-based constraints which are usally lesscostly to verify and prune the graph before using TDS (theeffectiveness of this ordering is evaluated in Fig 14(c))

High-level Algorithmic Approach Regardless of constrainttype NLCC leverages a token passing approach tokens areissued by background graph vertices whose correspondingtemplate vertices are identified to have non-local constraintsAfter a fixed number of steps we check if a token has arrived

1In a number of corner cases not all TDS constraints need to be aggregatedfor full precision we skip this discussion for lack of space

where expected (eg back to the originating vertex for check-ing the existence of a cycle) If not then the issuing vertex doesnot satisfy the required constraint and is eliminated Alongthe token path the system verifies that all expected labels areencountered and where necessary uses the path informationaccumulated with the token to verify that differentrepeatednode identity constraint expectations are met Next we discusshow each of the constraints is verified

Cycle Constraints (CC) Higher-order structures within Gthat survive LCC but do not contain G0 are possible if G0contains a cycle (this happens if G contains one or moreunrolled cycles as in Fig 2 Template (a)) To address thiswe directly check for cycles of the correct length

Path Constraints (PC) If the template G0 has two or morevertices with the same label three or more hops away fromeach other then structures in G that survive LCC yet containno match are possible (Fig 2 Template (b)) Thus for everyvertex pair with the same label in G0 we directly checkthe existence of a path of correct length and label sequencefor prospective matching vertices in Glowast Opposite to cyclechecking after a fixed number of steps a token must bereceived by a vertex different from the initiating vertex butwith an identical label

TDS Constraints These are partial or complete (ie in-cluding all edges of the template) walks on the template Thetoken walks the constraint in the background graph and verifiesthat each node visited meets its neighborhood constraints(Remark 1) - in our distributed memory setting this is done bymaintaining a history of the walk and checking that previouslyvisited vertices are revisited as expected TDS constraints arecrucial to guarantee no-false positives for templates that arenon-edge-monocyclic or have repeated labels (Fig 2)

Token Generation For CC and TDS constraints a tokenmust be initiated from each vertex that may participate in thesubstructure whereas for PC tokens are only initiated fromterminal vertices

Further Optimization - Work Aggregation All NLCC con-straints attempt to identify if a walk exists from a specificvertex and though vertices with specific labels Since the goalis to identify the existence of any such path and multipleintermediate paths in the background graph often exist toprevent combinatorial explosion our duplicate work detectionmechanism prevents an intermediately vertex (in the tokenpath) from forwarding a duplicate token (evaluated in sectV-F)

Non-local Constraint Generation For a relatively large classof templates (ie those with unique labels and mono-cyclicedges) LCC and CC are sufficient to generate a precise solu-tion For the general case we generate non-local constraintsusing the following heuristic (Fig 3 shows for a giventemplate the non-local constraints to be verified and Fig 4shows how pruning progresses) First all the leaf vertices withunique labels are identified and ignored from this process (asLCC guarantees pruning if there is no match) Next if thetemplate has cycles then individual cycles are identified (egFig 3 Step 3) and a cycle constraint is generated for eachcycle Next vertices with identical label are identified and all

path constraints are generated for all such pairs (eg Fig 3Step 4 pentagonal vertices)

Finally we identify TDS constraints in three steps Firstfor templates with multiple cycles sharing more than oneedge (ie non-edge-monocyclic) a TDS cyclic constraint isgenerated through union of previously identified cycle con-straints This results in a higher-order cyclic structure with amaximal set of edges that cover all the non-edge-mono cycles(eg Fig 3 Step 5(1)) Second for templates with repeatedlabels a new TDS constraint is generated through a unionof all previously identified path constraints This proceduregenerates higher order structure that covers all the templatevertices with repeated labels (eg Fig 3 Step 5(2)) Thefinal step generates a TDS constraint as the union of thepreviously identified two constraints (eg Fig 3 Step 5(3))(Note that the above is a heuristic more constraints could begenerated by creating various possible combinations of cyclesand paths Only this third step is mandatory to eliminate allfalse positives)

IV ASYNCHRONOUS DISTRIBUTED ALGORITHMS

This section presents the constraint checking algorithms ontop of HavoqGT [23] a MPI-based framework that supportsasynchronous graph algorithms in distributed environmentsOur choice for HavoqGTrsquos is driven by multiple considera-tions first unlike most graph processing frameworks that onlysupport the Bulk Synchronous Parallel (BSP) model HavoqGThas been designed to support asynchronous algorithms essen-tial to achieve high-performance second the framework hasexcellent scaling properties [24] [21] and finally it enablesload balancing HavoqGTrsquos delegate partitioned graph dis-tributes the edges of each high-degree vertices across multiplecompute nodes which is crucial for achieving scalability forscale-free graphs with skewed degree distribution

In HavoqGT graph algorithms are implemented as vertex-callbacks the user-defined visit() callback can only access andupdate the state of a vertex The framework offers the abilityto generate events (aka lsquovisitorsrsquo in HavoqGT lingo) thattrigger this callback - either at the entire graph level usingthe do traversal() method or for a neighboring vertex usingthe push(visitor) call (this enables asynchronous vertex-to-vertex communication) The asynchronous graph computationcompletes when all events have been processed which is de-termined by a distributed quiescence detection algorithm [25]

Alg 1 outlines the key steps of the graph pruning procedureBelow we describe the distributed implementation of the localand non-local constraint checking routines Alg 2 lists thestate maintained at each vertex and its initialization

Local Constraint Checking is implemented as an iterativeprocess (Alg 3 and the corresponding callback Alg 4) Eachiteration initiates an asynchronous traversal by invoking thedo traversal() method and as a result each active vertexreceives a visitor with msgtype = init In the triggered visit()callback if the label of a vertex vj in the graph is a match forthe label of any vertex in the template and the vertex is still

Algorithm 2 Vertex State and Initialization1 status of vertex vj α(vj)larr true (active) if existq isin V0 st `(vj) = `(q)

otherwise false (ie vj has been pruned)2 set of possible matches in template for vertex vj ω(vj) larr initially allqk isin V0 st `(qk) = `(vj)

3 map of active edges of vertex vj ε(vj)larr keys are initialized to adj(vj)The value field which is initially empty is set to ω(vi) for each vi isin ε(vj)that has communicated its state to vj

4 set of already forwarded tokens by vertex vj τ(vj) larr initially emptyused for work aggregation in NLCC

active it creates visitors for all its active neighbors in ε(vj)with msgtype = alive (Alg 4 line 9) When a vertex vj isvisited with msgtype = alive it verifies whether the sendervertex vs satisfies one of its own (vjrsquos) template constraintsby invoking the function η(vs vj) By the end of an iterationif vj satisfies all the template constraints ie it has neighborswith required labels (and if needed a minimum of distinctneighbors with the same label as prescribed in the template)it stays active (ie α(vj) = true) for the next iteration Fortemplates that have multiple vertices with the same label inany iteration a vertex with that label in the background graphcould match any of these vertices in the template so eachmach must be verified independently If vj fails to satisfy therequired constraints for a template vertex qk isin ω(vj) qk isremoved from ω(vj) At any stage if ω(vj) becomes emptythen vj is marked inactive (α(vj)larr false) and never createsvisitors again Edge elimination excludes two categories ofedges first the edges to neighbors vi isin ε(vj) from which vjdid not receive an alive message and second the edges toneighbors whose labels do not match the labels prescribed inthe adjacency structure of the corresponding template vertex(s)in ω(vj) A vertex vj is also marked inactive if its active edgelist ε(vj) becomes empty Iterations continue until no vertexandor edge is marked inactive

Algorithm 3 Local Constraint Checking1 η(vs vj) - tests if vs satisfies a local constraint of vj returns ω(s)2 if constraints are satisfied empty otherwise3 procedure LOCAL CONSTRAINT CHECKING (GG0)4 do5 do traversal(msgtype larr init)6 barrier7 for all vj isin V do8 ωprime larr empty set of matches in template for neighbors of vj9 for all vi isin ε(vj) do

10 if η(vi vj) = empty then11 ε(vj)remove(vi) edge eliminated12 continue13 else14 ωprime larr ωprime cup η(vi vj) accum matches of the nbrs15 reset the value field of vi isin ε(vj) for the next iteration16 for all qk isin ω(vj) do for each potential match17 if adj(qk) ωprime then18 qk does not meet neighbor requirements19 ω(vj)remove(qk) remove from potential matches20 continue21 if ε(vj) = empty or ω(vj) = empty then22 α(vj)larr false vertex eliminated23 while vertices or edges are eliminated global detection

Algorithm 4 Local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state vs - vertex that originated the visitor3 visitor state ω(vs) - set of possible matches in template for vertex vs4 visitor state msgtype - init or alive5 procedure VISIT(G vq) vq - visitor queue6 if α(vj) = false then return7 if msgtype = init then8 for all vi isin ε(vj) do9 vis larr LCC VISITOR(vi vj ω(vj) alive)

10 vqpush(vis)

11 else if msgtype = alive then12 ε(vj)get(vs)larr ω(vs)

Non-local Constraint Checking routine iterates over K0 theset of non-local constraints to be checked and validates eachC0 isin K0 one at a time Alg 5 describes the solution to verify asingle constraint tokens are initiated through an asynchronoustraversal by invoking the do traversal() method Each activevertex vj isin G that is a potential match for the vertex q0 at thehead of a path C0 broadcasts a token to all its active neighborsin ε(vj) A map γ is used to track these token issuers A tokenis a tuple (t r) where t is an ordered list of vertices that haveforwarded the token and r is the hop-counter t0 isin t is thetoken-issuing vertex in G The ordered list t is essential forTDS since it enables detection of distinct vertices with thesame label in the token path

Algorithm 5 Non-local Constraint Checking1 procedure NON LOCAL CONSTRAINT CHECKING(GG0 C0)2 γ larr map of token source vertices (in G) for C0 the value field3 (initialized to false) is set to true if the token source vertex meets4 the requirements of C05 do traversal(msgtype larr init)6 barrier7 for all vj isin γ do8 if γget(vj) 6= true then9 violates C0 eliminate potential match

10 ω(vj)remove(q0) where q0 is the first vertex in C011 if ω(vj) = empty then no potential match left12 α(vj)larr false vertex eliminated13 forallvj isin V reset τ(vj)

When an active vertex vj receives a token with msgtype =forward it verifies that if ω(vj) is a match for the next entryin C0 if it has received the token from a valid neighbor (withrespect to entries in C0) and that the current hop count is lt|C0| If these requirements are satisfied (ie micro returns true) vjsets itself as the forwarding vertex (added to t) increments thehop count and broadcasts the token to all its active neighborsin ε(vj) If any of the constraints are not met vj drops thetoken If the hop count r is equal to |C0| and vj is the same asthe source vertex in the token for a cyclic template a path hasbeen found and vj is marked true in γ For path constraintsan acknowledgement is sent to the token issuer to update itsstatus in γ (Alg 6 lines 28 ndash 31) Once verification ofa constraint C0 has been completed the vertices that are notmarked true in γ are invalidated ie α(vj)larr false (Alg 5line 12) NLCC uses an unordered set τ(vj) (Alg 2 line 4)for work aggregation (see Alg 6 line 14) at each vertexthis is used to detect if another copy of token has alreadyvisited the vertex vj taking a different path

Algorithm 6 Non-local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state token - the token is a tuple (t r) where t is an ordered list

of vertices that have forwarded the token and r is the hop-counter t0 isin tis the vertex that originated the token

3 visitor state msgtype - init forward or ack4 micro(vj C0 token) - tests if vj satisfies requirements of C0 for the current

state of token returns true if constraints are met false otherwise5 procedure VISIT(G vq)6 if α(vj) = false then return7 if msgtype = init and existqk isin ω(vj) where qk = q0 isin C0 then8 initiate a token vj is the token source9 tadd(vj) r larr 1 tokenlarr (t r) γinsert(vj false)

10 for all vi isin ε(vj) do11 vis larr NLCC VISITOR(vi token forward)12 vqpush(vis)

13 else if msgtype = forward then vj received a token14 if token isin τ(vj) then work aggregation optimization15 τ(vj)insert(token)16 else return ignore token if it was previously forwarded by vj17 if micro(vj C0 token) = true and tokenr lt |C0| then18 the walk can be extended with vj and it has not yet reached

|C0| length19 tokentadd(vj) tokenr larr tokenr + 120 for all vi isin ε(vj) do forward the token21 vis larr NLCC VISITOR(vi token forward)22 vqpush(vis)

23 else if micro(vj C0 token) = true and tokenr = |C0| then24 the walk can be extended with vj and it has reached

|C0| length25 if C0 is cyclic and t0 = vj then26 γget(vj)larr true return vj meets requirements of C027 else if C0 is acyclic and t0 6= vj then28 vis larr NLCC VISITOR(t0 token ack)29 vqpush(vis) send ack to the token originator t0 isin t30 else if msgtype = ack then31 γget(vj)larr true return vj meets requirements of C0

Termination Output and Match Enumeration Queries IfNLCC is not required the search terminates when no vertexis eliminated (or none of its potential matches is removed)in an LCC iteration Otherwise the search terminates whenall constraints in K0 have been verified and no vertex iseliminated (or none of its potential matches is removed) inthe following LCC phase The output is (i) the set of verticesand edges that survived the iterative elimination process and(ii) for each vertex in this set the mapping in the templatewhere a match has been identified A distributed enumerationor counting routine can operate on the pruned graph withthis information Alg 6 can be slightly modified to obtainthe enumeration of the matches in the background graph theconstraint used is the full template work aggregation is turnedoff and each possible match is verified

Metadata Store Metadata is stored independent of the graphtopology itself (which uses CSR format [26]) At initializationonly the required attributes are read from the file(s) stored ona distributed file system A light-weight distributed processbuilds the in-memory (or memory-mapped) metadata store On256 nodes for the 257 billion edge Web Data Commons graph[27] the metadata store can be built in under two minutesAlthough in this paper we consider vertex metadata (ielabels) only edge metadata is also supported

V EVALUATION

We present strong (sectV-B) and weak (sectV-A) scaling experi-ments of pruning on massive real-world and synthetic graphsadditionally we demonstrate full match enumeration startingfrom the pruned graph (sectV-C) we evaluate the effective-ness of the optimizations our system incorporates (sectV-F) wehighlight the use of our system in the context of realisticdata analytics scenarios (sectV-D) we explore time-to-solutionvs precisionguarantees trade-offs (sectV-E) and finally wecompare our solution with a recent work QFrag [28] (sectV-H)

Testbed The testbed is the 26PFlop Quartz cluster at theLawrence Livermore National Lab comprised of 2634 nodesand the Intel Omni-Path interconnect Each node has two 18-core Intel Xeon E5-2695v4 210GHz processors and 128GBof memory [29] We run one MPI process per core

Datasets We summarize the main characteristics of thedatasets used for evaluation and explain how we have gener-ated vertex labels where necessary For all graphs we createdundirected versions of the graphs two directed edges are usedto represent each undirected edge

The datasets used for evaluationType |V| 2|E| dmax

Web Data Commons [27] Real 35B 257B 95MReddit [30] Real 39B 14B 19MInternet Movie Database [31] Real 5M 29M 552KPatent [28] Real 27M 28M 789Youtube [28] Real 46M 88M 25KR-MAT up to Scale 37 [22] Synthetic 137B 44T 612M

Web Data Commons (WDC) graph is a web-graph whosevertices are webpages and edges are hyperlinks To createvertex labels we extract the top-level domain names fromthe webpage URL eg org or edu If the URL containsa common second-level domain name it is chosen over thetop-level domain name For example for oxacuk ac isselected as the vertex label A total of 2903 unique labelsare distributed among the 35B graph vertices We curatedthe Reddit social-media graph from an open archive [30]of billions of public posts and comments from RedditcomReddit allows its users to rate (upvote or downvote) othersrsquoposts and comments The graph has four types of verticesAuthor Post Comment and Subreddit (a category for posts)For the Post and Comment there are three possible labelsPositive Negative and Neutral (indicating the overall balanceof positive and negative votes) or No rating An edge ispossible between an Author and a Post an Author and aComment a Subreddit and a Post a Post and a Comment(to that post) and between two Comments that have a parent-child relationship The International Movie Database (IMDB)graph was curated from the publicly available data [31] Thegraph has five types of vertices Movie Genre Actress Actorand Director An edge is only possible between a Movie typevertex and a non-Movie type vertex We use the smaller Patentand YouTube graphs to compare published results by Serafiniet al [28] The synthetic R-MAT graphs exhibit approximatepower-law degree distribution These graphs were createdfollowing the Graph 500 [32] standards 2Scale vertices and an

2

4

3 7

5

3 7

TreeChain

7354723

Fig 5 Chain and Tree patterns used Both patterns have two pairs of verticeswith the same (numeric) label hence require non-local constraint checking(NLCC) more precisely path checking The labels used are the most frequentin the R-MAT graphs and cover sim30 of all the vertices in the graphs

Chain

28

620

5

16

020

5

15

50

56

5

41

990

6

16

34

58

5

499

182

6

86

35

0 7

15

08

95

6 6

2|E|

Tree

58

74 6

64

68 6

33

728

6

10

712

6

531

82 6

111

786

6

306

36 7

325

990

6PF (10x)

Compute nodes

R-MAT scale

0

9

18

27

36

45

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34 1

28

35 2

56

36 5

12

37

10

24

Tim

e (s

)

0

10

20

30

40

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34

12

8

35

25

6

36

51

2

37

102

4

109

225

6 5

36

36

04

7

165

71 4

93

05 5

896

27 5

244

03 5

94

85

9 5

289

312

4

50

26

3 6

876

37

0 5

|V| 33

35

5

37

04

5

19

02

5 5

61

57

6

30

35

6 5

63

48

2 5

17

66

8 6

18

57

21 5PF (10x)

633

889

5

20

77

29 5

635

350

636

977

3 5

256

77

0 7

147

96

8 5

LCC NLCC

Fig 6 Runtime and pattern selectivity for weak scaling experiments brokendown to individual iterations for the chain (left) and tree (right) patternspresented in Fig 5 The X-axis labels present the R-MAT scale and the nodecount used for the experiment (Each node hosts two processors each with 18cores) The number of vertices and edges in each pruned solution is shownon the top of respective bar plot The pruning factors (PF) ie the orderof magnitude reduction in number of verticesedges compared to the originalbackground graph are also shown A flat line indicates perfect weak scalingTime for LLC and NLCC phases is presented with different colors

undirected edge factor of 16 For example a Scale 30 graphhas |V| = 230 and |E| asymp 32 times 230 (as we create a directedversion) We leverage degree information to create vertexlabels computed using the formula `(vi) = dlog2(d(vi)+1)e

Search Templates To stress our system we use templatesbased on patterns naturally occurring and relatively frequentin the template graphs The R-MAT (Fig 5) and WDC (Fig 7)patterns include vertex labels that are among the most frequentin the respective graphs The Reddit and IMDB patterns(Fig 11) include most of the vertex labels in these two graphsWe chose templates to exercise different constraint checkingscenarios the search templates have repeated vertex metadataand non-edge-monocyclic properties

Experimental Methodology All runtime numbers providedare averages over 10 runs For weak scaling experimentswe do not present scaling numbers for a single node asthis experiment does not involve network communication andbenefits from data locality For strong scaling experimentsthe smallest experiment uses 64 nodes as this is the lowestnumber of nodes that can load the graph in memory

A Weak Scaling Experiments

To evaluate the ability to process massive graphs we useweak scaling experiments and the synthetic R-MAT graphsup to Scale 37 (sim44T edges) and up to 1024 nodes (36864cores) Fig 5 shows the two search pattern used and Fig 6presents the runtimes Since there are multiple vertices in the

WDC-1 WDC-3WDC-2

info

net edugovde

ac

org

gov edu

net

biz

info

mil ac

ca

gov edu

edu

edu

biz

deit

ac acmil

Fig 7 WDC patterns using topsecond-level domain names as labels Thelabels selected are among the most frequent covering sim22 of the verticesin the WDC graph org covers sim220M vertices the 2nd most frequent aftercom and mil is the least frequent among these labels coveringsim153K vertices

WDC-1 WDC-2 WDC-347232 117 81913|V|

020406080

100120

64

12

8

25

6

51

2

10

24

0

10

20

30

40

64

12

8

25

6

51

2

10

24

0

17

34

51

68

85

102

119

136

153

170

64

12

8

25

6

51

2

10

24

126212 546 2550222|E|

VE Only

VE and EE

0

10

20

30

40

50

64

12

8

25

6

51

2

10

24

40

1130

2220

3310

4400

64

12

8

25

6

51

2

10

24

LCC NLCC

Tim

e (s

)

Compute nodes

15x

21x27x

18x

35x

66x

18x

29x36x

112x 60x

31x

Fig 8 Runtime for strong scaling experiments broken down to individualphases (LCC and NLCC are in different colors) for the three patterns presentedin Fig 7 The top row of X-axis labels represent the number of compute nodes(Each node hosts two processors each with 18 cores) The last two rows arenumber of vertices and edges in the pruned graph respectively For bettervisibility for WDC-1 (left plots) runtime for different iterations are split intotwo scales on the Y-axis LCC and NLCC-path constraints are at the bottomand LCC and NLCC-TDS constraints are at the top To highlight the impactof edge elimination for WDC-3 (right plots) we present results with edgeelimination disabled (top) and enabled (bottom) Speedup over the 64 nodeconfiguration is also shown on the top of each stacked bar plot

pattern with identical labels (at more than two-hop distance)the patterns require NLCC - path checking - to ensure no falsepositives in the pruned graph We see steady scaling all theway to the Scale 37 graph which has sim44 trillion edgeson 1024 nodes (36864 cores) Runtime is broken down tothe individual iteration to evaluate scaling and the individualcontribution of each intermediate step As a graph gets prunedthe subsequent iterations require less time Fig 6 includes (atthe top of each bar) the final number of vertices and edgesthat participate in the respective patterns Note that the NLCCphases (needed to guarantee a precise solution) do not deleteany vertex or edge hence no further LCC phase is invoked

B Strong Scaling Experiments

Fig 8 shows the runtimes for strong scaling experiments whenusing the real-world WDC graph on up to 1024 nodes (36864cores) Intuitively pattern matching on the WDC graph isharder than on the R-MAT graphs as the WDC graph is bothdenser and has a more skewed degree distribution We usethe patterns presented in Fig 7 WDC-1 is acyclic yet hasmultiple vertices with the same label and thus requires non-local constraint checking (PC and TDS) For better visibilitythe plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left) We noticenear perfect scaling for the LCC phases however some of

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 5: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

where expected (eg back to the originating vertex for check-ing the existence of a cycle) If not then the issuing vertex doesnot satisfy the required constraint and is eliminated Alongthe token path the system verifies that all expected labels areencountered and where necessary uses the path informationaccumulated with the token to verify that differentrepeatednode identity constraint expectations are met Next we discusshow each of the constraints is verified

Cycle Constraints (CC) Higher-order structures within Gthat survive LCC but do not contain G0 are possible if G0contains a cycle (this happens if G contains one or moreunrolled cycles as in Fig 2 Template (a)) To address thiswe directly check for cycles of the correct length

Path Constraints (PC) If the template G0 has two or morevertices with the same label three or more hops away fromeach other then structures in G that survive LCC yet containno match are possible (Fig 2 Template (b)) Thus for everyvertex pair with the same label in G0 we directly checkthe existence of a path of correct length and label sequencefor prospective matching vertices in Glowast Opposite to cyclechecking after a fixed number of steps a token must bereceived by a vertex different from the initiating vertex butwith an identical label

TDS Constraints These are partial or complete (ie in-cluding all edges of the template) walks on the template Thetoken walks the constraint in the background graph and verifiesthat each node visited meets its neighborhood constraints(Remark 1) - in our distributed memory setting this is done bymaintaining a history of the walk and checking that previouslyvisited vertices are revisited as expected TDS constraints arecrucial to guarantee no-false positives for templates that arenon-edge-monocyclic or have repeated labels (Fig 2)

Token Generation For CC and TDS constraints a tokenmust be initiated from each vertex that may participate in thesubstructure whereas for PC tokens are only initiated fromterminal vertices

Further Optimization - Work Aggregation All NLCC con-straints attempt to identify if a walk exists from a specificvertex and though vertices with specific labels Since the goalis to identify the existence of any such path and multipleintermediate paths in the background graph often exist toprevent combinatorial explosion our duplicate work detectionmechanism prevents an intermediately vertex (in the tokenpath) from forwarding a duplicate token (evaluated in sectV-F)

Non-local Constraint Generation For a relatively large classof templates (ie those with unique labels and mono-cyclicedges) LCC and CC are sufficient to generate a precise solu-tion For the general case we generate non-local constraintsusing the following heuristic (Fig 3 shows for a giventemplate the non-local constraints to be verified and Fig 4shows how pruning progresses) First all the leaf vertices withunique labels are identified and ignored from this process (asLCC guarantees pruning if there is no match) Next if thetemplate has cycles then individual cycles are identified (egFig 3 Step 3) and a cycle constraint is generated for eachcycle Next vertices with identical label are identified and all

path constraints are generated for all such pairs (eg Fig 3Step 4 pentagonal vertices)

Finally we identify TDS constraints in three steps Firstfor templates with multiple cycles sharing more than oneedge (ie non-edge-monocyclic) a TDS cyclic constraint isgenerated through union of previously identified cycle con-straints This results in a higher-order cyclic structure with amaximal set of edges that cover all the non-edge-mono cycles(eg Fig 3 Step 5(1)) Second for templates with repeatedlabels a new TDS constraint is generated through a unionof all previously identified path constraints This proceduregenerates higher order structure that covers all the templatevertices with repeated labels (eg Fig 3 Step 5(2)) Thefinal step generates a TDS constraint as the union of thepreviously identified two constraints (eg Fig 3 Step 5(3))(Note that the above is a heuristic more constraints could begenerated by creating various possible combinations of cyclesand paths Only this third step is mandatory to eliminate allfalse positives)

IV ASYNCHRONOUS DISTRIBUTED ALGORITHMS

This section presents the constraint checking algorithms ontop of HavoqGT [23] a MPI-based framework that supportsasynchronous graph algorithms in distributed environmentsOur choice for HavoqGTrsquos is driven by multiple considera-tions first unlike most graph processing frameworks that onlysupport the Bulk Synchronous Parallel (BSP) model HavoqGThas been designed to support asynchronous algorithms essen-tial to achieve high-performance second the framework hasexcellent scaling properties [24] [21] and finally it enablesload balancing HavoqGTrsquos delegate partitioned graph dis-tributes the edges of each high-degree vertices across multiplecompute nodes which is crucial for achieving scalability forscale-free graphs with skewed degree distribution

In HavoqGT graph algorithms are implemented as vertex-callbacks the user-defined visit() callback can only access andupdate the state of a vertex The framework offers the abilityto generate events (aka lsquovisitorsrsquo in HavoqGT lingo) thattrigger this callback - either at the entire graph level usingthe do traversal() method or for a neighboring vertex usingthe push(visitor) call (this enables asynchronous vertex-to-vertex communication) The asynchronous graph computationcompletes when all events have been processed which is de-termined by a distributed quiescence detection algorithm [25]

Alg 1 outlines the key steps of the graph pruning procedureBelow we describe the distributed implementation of the localand non-local constraint checking routines Alg 2 lists thestate maintained at each vertex and its initialization

Local Constraint Checking is implemented as an iterativeprocess (Alg 3 and the corresponding callback Alg 4) Eachiteration initiates an asynchronous traversal by invoking thedo traversal() method and as a result each active vertexreceives a visitor with msgtype = init In the triggered visit()callback if the label of a vertex vj in the graph is a match forthe label of any vertex in the template and the vertex is still

Algorithm 2 Vertex State and Initialization1 status of vertex vj α(vj)larr true (active) if existq isin V0 st `(vj) = `(q)

otherwise false (ie vj has been pruned)2 set of possible matches in template for vertex vj ω(vj) larr initially allqk isin V0 st `(qk) = `(vj)

3 map of active edges of vertex vj ε(vj)larr keys are initialized to adj(vj)The value field which is initially empty is set to ω(vi) for each vi isin ε(vj)that has communicated its state to vj

4 set of already forwarded tokens by vertex vj τ(vj) larr initially emptyused for work aggregation in NLCC

active it creates visitors for all its active neighbors in ε(vj)with msgtype = alive (Alg 4 line 9) When a vertex vj isvisited with msgtype = alive it verifies whether the sendervertex vs satisfies one of its own (vjrsquos) template constraintsby invoking the function η(vs vj) By the end of an iterationif vj satisfies all the template constraints ie it has neighborswith required labels (and if needed a minimum of distinctneighbors with the same label as prescribed in the template)it stays active (ie α(vj) = true) for the next iteration Fortemplates that have multiple vertices with the same label inany iteration a vertex with that label in the background graphcould match any of these vertices in the template so eachmach must be verified independently If vj fails to satisfy therequired constraints for a template vertex qk isin ω(vj) qk isremoved from ω(vj) At any stage if ω(vj) becomes emptythen vj is marked inactive (α(vj)larr false) and never createsvisitors again Edge elimination excludes two categories ofedges first the edges to neighbors vi isin ε(vj) from which vjdid not receive an alive message and second the edges toneighbors whose labels do not match the labels prescribed inthe adjacency structure of the corresponding template vertex(s)in ω(vj) A vertex vj is also marked inactive if its active edgelist ε(vj) becomes empty Iterations continue until no vertexandor edge is marked inactive

Algorithm 3 Local Constraint Checking1 η(vs vj) - tests if vs satisfies a local constraint of vj returns ω(s)2 if constraints are satisfied empty otherwise3 procedure LOCAL CONSTRAINT CHECKING (GG0)4 do5 do traversal(msgtype larr init)6 barrier7 for all vj isin V do8 ωprime larr empty set of matches in template for neighbors of vj9 for all vi isin ε(vj) do

10 if η(vi vj) = empty then11 ε(vj)remove(vi) edge eliminated12 continue13 else14 ωprime larr ωprime cup η(vi vj) accum matches of the nbrs15 reset the value field of vi isin ε(vj) for the next iteration16 for all qk isin ω(vj) do for each potential match17 if adj(qk) ωprime then18 qk does not meet neighbor requirements19 ω(vj)remove(qk) remove from potential matches20 continue21 if ε(vj) = empty or ω(vj) = empty then22 α(vj)larr false vertex eliminated23 while vertices or edges are eliminated global detection

Algorithm 4 Local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state vs - vertex that originated the visitor3 visitor state ω(vs) - set of possible matches in template for vertex vs4 visitor state msgtype - init or alive5 procedure VISIT(G vq) vq - visitor queue6 if α(vj) = false then return7 if msgtype = init then8 for all vi isin ε(vj) do9 vis larr LCC VISITOR(vi vj ω(vj) alive)

10 vqpush(vis)

11 else if msgtype = alive then12 ε(vj)get(vs)larr ω(vs)

Non-local Constraint Checking routine iterates over K0 theset of non-local constraints to be checked and validates eachC0 isin K0 one at a time Alg 5 describes the solution to verify asingle constraint tokens are initiated through an asynchronoustraversal by invoking the do traversal() method Each activevertex vj isin G that is a potential match for the vertex q0 at thehead of a path C0 broadcasts a token to all its active neighborsin ε(vj) A map γ is used to track these token issuers A tokenis a tuple (t r) where t is an ordered list of vertices that haveforwarded the token and r is the hop-counter t0 isin t is thetoken-issuing vertex in G The ordered list t is essential forTDS since it enables detection of distinct vertices with thesame label in the token path

Algorithm 5 Non-local Constraint Checking1 procedure NON LOCAL CONSTRAINT CHECKING(GG0 C0)2 γ larr map of token source vertices (in G) for C0 the value field3 (initialized to false) is set to true if the token source vertex meets4 the requirements of C05 do traversal(msgtype larr init)6 barrier7 for all vj isin γ do8 if γget(vj) 6= true then9 violates C0 eliminate potential match

10 ω(vj)remove(q0) where q0 is the first vertex in C011 if ω(vj) = empty then no potential match left12 α(vj)larr false vertex eliminated13 forallvj isin V reset τ(vj)

When an active vertex vj receives a token with msgtype =forward it verifies that if ω(vj) is a match for the next entryin C0 if it has received the token from a valid neighbor (withrespect to entries in C0) and that the current hop count is lt|C0| If these requirements are satisfied (ie micro returns true) vjsets itself as the forwarding vertex (added to t) increments thehop count and broadcasts the token to all its active neighborsin ε(vj) If any of the constraints are not met vj drops thetoken If the hop count r is equal to |C0| and vj is the same asthe source vertex in the token for a cyclic template a path hasbeen found and vj is marked true in γ For path constraintsan acknowledgement is sent to the token issuer to update itsstatus in γ (Alg 6 lines 28 ndash 31) Once verification ofa constraint C0 has been completed the vertices that are notmarked true in γ are invalidated ie α(vj)larr false (Alg 5line 12) NLCC uses an unordered set τ(vj) (Alg 2 line 4)for work aggregation (see Alg 6 line 14) at each vertexthis is used to detect if another copy of token has alreadyvisited the vertex vj taking a different path

Algorithm 6 Non-local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state token - the token is a tuple (t r) where t is an ordered list

of vertices that have forwarded the token and r is the hop-counter t0 isin tis the vertex that originated the token

3 visitor state msgtype - init forward or ack4 micro(vj C0 token) - tests if vj satisfies requirements of C0 for the current

state of token returns true if constraints are met false otherwise5 procedure VISIT(G vq)6 if α(vj) = false then return7 if msgtype = init and existqk isin ω(vj) where qk = q0 isin C0 then8 initiate a token vj is the token source9 tadd(vj) r larr 1 tokenlarr (t r) γinsert(vj false)

10 for all vi isin ε(vj) do11 vis larr NLCC VISITOR(vi token forward)12 vqpush(vis)

13 else if msgtype = forward then vj received a token14 if token isin τ(vj) then work aggregation optimization15 τ(vj)insert(token)16 else return ignore token if it was previously forwarded by vj17 if micro(vj C0 token) = true and tokenr lt |C0| then18 the walk can be extended with vj and it has not yet reached

|C0| length19 tokentadd(vj) tokenr larr tokenr + 120 for all vi isin ε(vj) do forward the token21 vis larr NLCC VISITOR(vi token forward)22 vqpush(vis)

23 else if micro(vj C0 token) = true and tokenr = |C0| then24 the walk can be extended with vj and it has reached

|C0| length25 if C0 is cyclic and t0 = vj then26 γget(vj)larr true return vj meets requirements of C027 else if C0 is acyclic and t0 6= vj then28 vis larr NLCC VISITOR(t0 token ack)29 vqpush(vis) send ack to the token originator t0 isin t30 else if msgtype = ack then31 γget(vj)larr true return vj meets requirements of C0

Termination Output and Match Enumeration Queries IfNLCC is not required the search terminates when no vertexis eliminated (or none of its potential matches is removed)in an LCC iteration Otherwise the search terminates whenall constraints in K0 have been verified and no vertex iseliminated (or none of its potential matches is removed) inthe following LCC phase The output is (i) the set of verticesand edges that survived the iterative elimination process and(ii) for each vertex in this set the mapping in the templatewhere a match has been identified A distributed enumerationor counting routine can operate on the pruned graph withthis information Alg 6 can be slightly modified to obtainthe enumeration of the matches in the background graph theconstraint used is the full template work aggregation is turnedoff and each possible match is verified

Metadata Store Metadata is stored independent of the graphtopology itself (which uses CSR format [26]) At initializationonly the required attributes are read from the file(s) stored ona distributed file system A light-weight distributed processbuilds the in-memory (or memory-mapped) metadata store On256 nodes for the 257 billion edge Web Data Commons graph[27] the metadata store can be built in under two minutesAlthough in this paper we consider vertex metadata (ielabels) only edge metadata is also supported

V EVALUATION

We present strong (sectV-B) and weak (sectV-A) scaling experi-ments of pruning on massive real-world and synthetic graphsadditionally we demonstrate full match enumeration startingfrom the pruned graph (sectV-C) we evaluate the effective-ness of the optimizations our system incorporates (sectV-F) wehighlight the use of our system in the context of realisticdata analytics scenarios (sectV-D) we explore time-to-solutionvs precisionguarantees trade-offs (sectV-E) and finally wecompare our solution with a recent work QFrag [28] (sectV-H)

Testbed The testbed is the 26PFlop Quartz cluster at theLawrence Livermore National Lab comprised of 2634 nodesand the Intel Omni-Path interconnect Each node has two 18-core Intel Xeon E5-2695v4 210GHz processors and 128GBof memory [29] We run one MPI process per core

Datasets We summarize the main characteristics of thedatasets used for evaluation and explain how we have gener-ated vertex labels where necessary For all graphs we createdundirected versions of the graphs two directed edges are usedto represent each undirected edge

The datasets used for evaluationType |V| 2|E| dmax

Web Data Commons [27] Real 35B 257B 95MReddit [30] Real 39B 14B 19MInternet Movie Database [31] Real 5M 29M 552KPatent [28] Real 27M 28M 789Youtube [28] Real 46M 88M 25KR-MAT up to Scale 37 [22] Synthetic 137B 44T 612M

Web Data Commons (WDC) graph is a web-graph whosevertices are webpages and edges are hyperlinks To createvertex labels we extract the top-level domain names fromthe webpage URL eg org or edu If the URL containsa common second-level domain name it is chosen over thetop-level domain name For example for oxacuk ac isselected as the vertex label A total of 2903 unique labelsare distributed among the 35B graph vertices We curatedthe Reddit social-media graph from an open archive [30]of billions of public posts and comments from RedditcomReddit allows its users to rate (upvote or downvote) othersrsquoposts and comments The graph has four types of verticesAuthor Post Comment and Subreddit (a category for posts)For the Post and Comment there are three possible labelsPositive Negative and Neutral (indicating the overall balanceof positive and negative votes) or No rating An edge ispossible between an Author and a Post an Author and aComment a Subreddit and a Post a Post and a Comment(to that post) and between two Comments that have a parent-child relationship The International Movie Database (IMDB)graph was curated from the publicly available data [31] Thegraph has five types of vertices Movie Genre Actress Actorand Director An edge is only possible between a Movie typevertex and a non-Movie type vertex We use the smaller Patentand YouTube graphs to compare published results by Serafiniet al [28] The synthetic R-MAT graphs exhibit approximatepower-law degree distribution These graphs were createdfollowing the Graph 500 [32] standards 2Scale vertices and an

2

4

3 7

5

3 7

TreeChain

7354723

Fig 5 Chain and Tree patterns used Both patterns have two pairs of verticeswith the same (numeric) label hence require non-local constraint checking(NLCC) more precisely path checking The labels used are the most frequentin the R-MAT graphs and cover sim30 of all the vertices in the graphs

Chain

28

620

5

16

020

5

15

50

56

5

41

990

6

16

34

58

5

499

182

6

86

35

0 7

15

08

95

6 6

2|E|

Tree

58

74 6

64

68 6

33

728

6

10

712

6

531

82 6

111

786

6

306

36 7

325

990

6PF (10x)

Compute nodes

R-MAT scale

0

9

18

27

36

45

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34 1

28

35 2

56

36 5

12

37

10

24

Tim

e (s

)

0

10

20

30

40

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34

12

8

35

25

6

36

51

2

37

102

4

109

225

6 5

36

36

04

7

165

71 4

93

05 5

896

27 5

244

03 5

94

85

9 5

289

312

4

50

26

3 6

876

37

0 5

|V| 33

35

5

37

04

5

19

02

5 5

61

57

6

30

35

6 5

63

48

2 5

17

66

8 6

18

57

21 5PF (10x)

633

889

5

20

77

29 5

635

350

636

977

3 5

256

77

0 7

147

96

8 5

LCC NLCC

Fig 6 Runtime and pattern selectivity for weak scaling experiments brokendown to individual iterations for the chain (left) and tree (right) patternspresented in Fig 5 The X-axis labels present the R-MAT scale and the nodecount used for the experiment (Each node hosts two processors each with 18cores) The number of vertices and edges in each pruned solution is shownon the top of respective bar plot The pruning factors (PF) ie the orderof magnitude reduction in number of verticesedges compared to the originalbackground graph are also shown A flat line indicates perfect weak scalingTime for LLC and NLCC phases is presented with different colors

undirected edge factor of 16 For example a Scale 30 graphhas |V| = 230 and |E| asymp 32 times 230 (as we create a directedversion) We leverage degree information to create vertexlabels computed using the formula `(vi) = dlog2(d(vi)+1)e

Search Templates To stress our system we use templatesbased on patterns naturally occurring and relatively frequentin the template graphs The R-MAT (Fig 5) and WDC (Fig 7)patterns include vertex labels that are among the most frequentin the respective graphs The Reddit and IMDB patterns(Fig 11) include most of the vertex labels in these two graphsWe chose templates to exercise different constraint checkingscenarios the search templates have repeated vertex metadataand non-edge-monocyclic properties

Experimental Methodology All runtime numbers providedare averages over 10 runs For weak scaling experimentswe do not present scaling numbers for a single node asthis experiment does not involve network communication andbenefits from data locality For strong scaling experimentsthe smallest experiment uses 64 nodes as this is the lowestnumber of nodes that can load the graph in memory

A Weak Scaling Experiments

To evaluate the ability to process massive graphs we useweak scaling experiments and the synthetic R-MAT graphsup to Scale 37 (sim44T edges) and up to 1024 nodes (36864cores) Fig 5 shows the two search pattern used and Fig 6presents the runtimes Since there are multiple vertices in the

WDC-1 WDC-3WDC-2

info

net edugovde

ac

org

gov edu

net

biz

info

mil ac

ca

gov edu

edu

edu

biz

deit

ac acmil

Fig 7 WDC patterns using topsecond-level domain names as labels Thelabels selected are among the most frequent covering sim22 of the verticesin the WDC graph org covers sim220M vertices the 2nd most frequent aftercom and mil is the least frequent among these labels coveringsim153K vertices

WDC-1 WDC-2 WDC-347232 117 81913|V|

020406080

100120

64

12

8

25

6

51

2

10

24

0

10

20

30

40

64

12

8

25

6

51

2

10

24

0

17

34

51

68

85

102

119

136

153

170

64

12

8

25

6

51

2

10

24

126212 546 2550222|E|

VE Only

VE and EE

0

10

20

30

40

50

64

12

8

25

6

51

2

10

24

40

1130

2220

3310

4400

64

12

8

25

6

51

2

10

24

LCC NLCC

Tim

e (s

)

Compute nodes

15x

21x27x

18x

35x

66x

18x

29x36x

112x 60x

31x

Fig 8 Runtime for strong scaling experiments broken down to individualphases (LCC and NLCC are in different colors) for the three patterns presentedin Fig 7 The top row of X-axis labels represent the number of compute nodes(Each node hosts two processors each with 18 cores) The last two rows arenumber of vertices and edges in the pruned graph respectively For bettervisibility for WDC-1 (left plots) runtime for different iterations are split intotwo scales on the Y-axis LCC and NLCC-path constraints are at the bottomand LCC and NLCC-TDS constraints are at the top To highlight the impactof edge elimination for WDC-3 (right plots) we present results with edgeelimination disabled (top) and enabled (bottom) Speedup over the 64 nodeconfiguration is also shown on the top of each stacked bar plot

pattern with identical labels (at more than two-hop distance)the patterns require NLCC - path checking - to ensure no falsepositives in the pruned graph We see steady scaling all theway to the Scale 37 graph which has sim44 trillion edgeson 1024 nodes (36864 cores) Runtime is broken down tothe individual iteration to evaluate scaling and the individualcontribution of each intermediate step As a graph gets prunedthe subsequent iterations require less time Fig 6 includes (atthe top of each bar) the final number of vertices and edgesthat participate in the respective patterns Note that the NLCCphases (needed to guarantee a precise solution) do not deleteany vertex or edge hence no further LCC phase is invoked

B Strong Scaling Experiments

Fig 8 shows the runtimes for strong scaling experiments whenusing the real-world WDC graph on up to 1024 nodes (36864cores) Intuitively pattern matching on the WDC graph isharder than on the R-MAT graphs as the WDC graph is bothdenser and has a more skewed degree distribution We usethe patterns presented in Fig 7 WDC-1 is acyclic yet hasmultiple vertices with the same label and thus requires non-local constraint checking (PC and TDS) For better visibilitythe plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left) We noticenear perfect scaling for the LCC phases however some of

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 6: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

Algorithm 2 Vertex State and Initialization1 status of vertex vj α(vj)larr true (active) if existq isin V0 st `(vj) = `(q)

otherwise false (ie vj has been pruned)2 set of possible matches in template for vertex vj ω(vj) larr initially allqk isin V0 st `(qk) = `(vj)

3 map of active edges of vertex vj ε(vj)larr keys are initialized to adj(vj)The value field which is initially empty is set to ω(vi) for each vi isin ε(vj)that has communicated its state to vj

4 set of already forwarded tokens by vertex vj τ(vj) larr initially emptyused for work aggregation in NLCC

active it creates visitors for all its active neighbors in ε(vj)with msgtype = alive (Alg 4 line 9) When a vertex vj isvisited with msgtype = alive it verifies whether the sendervertex vs satisfies one of its own (vjrsquos) template constraintsby invoking the function η(vs vj) By the end of an iterationif vj satisfies all the template constraints ie it has neighborswith required labels (and if needed a minimum of distinctneighbors with the same label as prescribed in the template)it stays active (ie α(vj) = true) for the next iteration Fortemplates that have multiple vertices with the same label inany iteration a vertex with that label in the background graphcould match any of these vertices in the template so eachmach must be verified independently If vj fails to satisfy therequired constraints for a template vertex qk isin ω(vj) qk isremoved from ω(vj) At any stage if ω(vj) becomes emptythen vj is marked inactive (α(vj)larr false) and never createsvisitors again Edge elimination excludes two categories ofedges first the edges to neighbors vi isin ε(vj) from which vjdid not receive an alive message and second the edges toneighbors whose labels do not match the labels prescribed inthe adjacency structure of the corresponding template vertex(s)in ω(vj) A vertex vj is also marked inactive if its active edgelist ε(vj) becomes empty Iterations continue until no vertexandor edge is marked inactive

Algorithm 3 Local Constraint Checking1 η(vs vj) - tests if vs satisfies a local constraint of vj returns ω(s)2 if constraints are satisfied empty otherwise3 procedure LOCAL CONSTRAINT CHECKING (GG0)4 do5 do traversal(msgtype larr init)6 barrier7 for all vj isin V do8 ωprime larr empty set of matches in template for neighbors of vj9 for all vi isin ε(vj) do

10 if η(vi vj) = empty then11 ε(vj)remove(vi) edge eliminated12 continue13 else14 ωprime larr ωprime cup η(vi vj) accum matches of the nbrs15 reset the value field of vi isin ε(vj) for the next iteration16 for all qk isin ω(vj) do for each potential match17 if adj(qk) ωprime then18 qk does not meet neighbor requirements19 ω(vj)remove(qk) remove from potential matches20 continue21 if ε(vj) = empty or ω(vj) = empty then22 α(vj)larr false vertex eliminated23 while vertices or edges are eliminated global detection

Algorithm 4 Local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state vs - vertex that originated the visitor3 visitor state ω(vs) - set of possible matches in template for vertex vs4 visitor state msgtype - init or alive5 procedure VISIT(G vq) vq - visitor queue6 if α(vj) = false then return7 if msgtype = init then8 for all vi isin ε(vj) do9 vis larr LCC VISITOR(vi vj ω(vj) alive)

10 vqpush(vis)

11 else if msgtype = alive then12 ε(vj)get(vs)larr ω(vs)

Non-local Constraint Checking routine iterates over K0 theset of non-local constraints to be checked and validates eachC0 isin K0 one at a time Alg 5 describes the solution to verify asingle constraint tokens are initiated through an asynchronoustraversal by invoking the do traversal() method Each activevertex vj isin G that is a potential match for the vertex q0 at thehead of a path C0 broadcasts a token to all its active neighborsin ε(vj) A map γ is used to track these token issuers A tokenis a tuple (t r) where t is an ordered list of vertices that haveforwarded the token and r is the hop-counter t0 isin t is thetoken-issuing vertex in G The ordered list t is essential forTDS since it enables detection of distinct vertices with thesame label in the token path

Algorithm 5 Non-local Constraint Checking1 procedure NON LOCAL CONSTRAINT CHECKING(GG0 C0)2 γ larr map of token source vertices (in G) for C0 the value field3 (initialized to false) is set to true if the token source vertex meets4 the requirements of C05 do traversal(msgtype larr init)6 barrier7 for all vj isin γ do8 if γget(vj) 6= true then9 violates C0 eliminate potential match

10 ω(vj)remove(q0) where q0 is the first vertex in C011 if ω(vj) = empty then no potential match left12 α(vj)larr false vertex eliminated13 forallvj isin V reset τ(vj)

When an active vertex vj receives a token with msgtype =forward it verifies that if ω(vj) is a match for the next entryin C0 if it has received the token from a valid neighbor (withrespect to entries in C0) and that the current hop count is lt|C0| If these requirements are satisfied (ie micro returns true) vjsets itself as the forwarding vertex (added to t) increments thehop count and broadcasts the token to all its active neighborsin ε(vj) If any of the constraints are not met vj drops thetoken If the hop count r is equal to |C0| and vj is the same asthe source vertex in the token for a cyclic template a path hasbeen found and vj is marked true in γ For path constraintsan acknowledgement is sent to the token issuer to update itsstatus in γ (Alg 6 lines 28 ndash 31) Once verification ofa constraint C0 has been completed the vertices that are notmarked true in γ are invalidated ie α(vj)larr false (Alg 5line 12) NLCC uses an unordered set τ(vj) (Alg 2 line 4)for work aggregation (see Alg 6 line 14) at each vertexthis is used to detect if another copy of token has alreadyvisited the vertex vj taking a different path

Algorithm 6 Non-local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state token - the token is a tuple (t r) where t is an ordered list

of vertices that have forwarded the token and r is the hop-counter t0 isin tis the vertex that originated the token

3 visitor state msgtype - init forward or ack4 micro(vj C0 token) - tests if vj satisfies requirements of C0 for the current

state of token returns true if constraints are met false otherwise5 procedure VISIT(G vq)6 if α(vj) = false then return7 if msgtype = init and existqk isin ω(vj) where qk = q0 isin C0 then8 initiate a token vj is the token source9 tadd(vj) r larr 1 tokenlarr (t r) γinsert(vj false)

10 for all vi isin ε(vj) do11 vis larr NLCC VISITOR(vi token forward)12 vqpush(vis)

13 else if msgtype = forward then vj received a token14 if token isin τ(vj) then work aggregation optimization15 τ(vj)insert(token)16 else return ignore token if it was previously forwarded by vj17 if micro(vj C0 token) = true and tokenr lt |C0| then18 the walk can be extended with vj and it has not yet reached

|C0| length19 tokentadd(vj) tokenr larr tokenr + 120 for all vi isin ε(vj) do forward the token21 vis larr NLCC VISITOR(vi token forward)22 vqpush(vis)

23 else if micro(vj C0 token) = true and tokenr = |C0| then24 the walk can be extended with vj and it has reached

|C0| length25 if C0 is cyclic and t0 = vj then26 γget(vj)larr true return vj meets requirements of C027 else if C0 is acyclic and t0 6= vj then28 vis larr NLCC VISITOR(t0 token ack)29 vqpush(vis) send ack to the token originator t0 isin t30 else if msgtype = ack then31 γget(vj)larr true return vj meets requirements of C0

Termination Output and Match Enumeration Queries IfNLCC is not required the search terminates when no vertexis eliminated (or none of its potential matches is removed)in an LCC iteration Otherwise the search terminates whenall constraints in K0 have been verified and no vertex iseliminated (or none of its potential matches is removed) inthe following LCC phase The output is (i) the set of verticesand edges that survived the iterative elimination process and(ii) for each vertex in this set the mapping in the templatewhere a match has been identified A distributed enumerationor counting routine can operate on the pruned graph withthis information Alg 6 can be slightly modified to obtainthe enumeration of the matches in the background graph theconstraint used is the full template work aggregation is turnedoff and each possible match is verified

Metadata Store Metadata is stored independent of the graphtopology itself (which uses CSR format [26]) At initializationonly the required attributes are read from the file(s) stored ona distributed file system A light-weight distributed processbuilds the in-memory (or memory-mapped) metadata store On256 nodes for the 257 billion edge Web Data Commons graph[27] the metadata store can be built in under two minutesAlthough in this paper we consider vertex metadata (ielabels) only edge metadata is also supported

V EVALUATION

We present strong (sectV-B) and weak (sectV-A) scaling experi-ments of pruning on massive real-world and synthetic graphsadditionally we demonstrate full match enumeration startingfrom the pruned graph (sectV-C) we evaluate the effective-ness of the optimizations our system incorporates (sectV-F) wehighlight the use of our system in the context of realisticdata analytics scenarios (sectV-D) we explore time-to-solutionvs precisionguarantees trade-offs (sectV-E) and finally wecompare our solution with a recent work QFrag [28] (sectV-H)

Testbed The testbed is the 26PFlop Quartz cluster at theLawrence Livermore National Lab comprised of 2634 nodesand the Intel Omni-Path interconnect Each node has two 18-core Intel Xeon E5-2695v4 210GHz processors and 128GBof memory [29] We run one MPI process per core

Datasets We summarize the main characteristics of thedatasets used for evaluation and explain how we have gener-ated vertex labels where necessary For all graphs we createdundirected versions of the graphs two directed edges are usedto represent each undirected edge

The datasets used for evaluationType |V| 2|E| dmax

Web Data Commons [27] Real 35B 257B 95MReddit [30] Real 39B 14B 19MInternet Movie Database [31] Real 5M 29M 552KPatent [28] Real 27M 28M 789Youtube [28] Real 46M 88M 25KR-MAT up to Scale 37 [22] Synthetic 137B 44T 612M

Web Data Commons (WDC) graph is a web-graph whosevertices are webpages and edges are hyperlinks To createvertex labels we extract the top-level domain names fromthe webpage URL eg org or edu If the URL containsa common second-level domain name it is chosen over thetop-level domain name For example for oxacuk ac isselected as the vertex label A total of 2903 unique labelsare distributed among the 35B graph vertices We curatedthe Reddit social-media graph from an open archive [30]of billions of public posts and comments from RedditcomReddit allows its users to rate (upvote or downvote) othersrsquoposts and comments The graph has four types of verticesAuthor Post Comment and Subreddit (a category for posts)For the Post and Comment there are three possible labelsPositive Negative and Neutral (indicating the overall balanceof positive and negative votes) or No rating An edge ispossible between an Author and a Post an Author and aComment a Subreddit and a Post a Post and a Comment(to that post) and between two Comments that have a parent-child relationship The International Movie Database (IMDB)graph was curated from the publicly available data [31] Thegraph has five types of vertices Movie Genre Actress Actorand Director An edge is only possible between a Movie typevertex and a non-Movie type vertex We use the smaller Patentand YouTube graphs to compare published results by Serafiniet al [28] The synthetic R-MAT graphs exhibit approximatepower-law degree distribution These graphs were createdfollowing the Graph 500 [32] standards 2Scale vertices and an

2

4

3 7

5

3 7

TreeChain

7354723

Fig 5 Chain and Tree patterns used Both patterns have two pairs of verticeswith the same (numeric) label hence require non-local constraint checking(NLCC) more precisely path checking The labels used are the most frequentin the R-MAT graphs and cover sim30 of all the vertices in the graphs

Chain

28

620

5

16

020

5

15

50

56

5

41

990

6

16

34

58

5

499

182

6

86

35

0 7

15

08

95

6 6

2|E|

Tree

58

74 6

64

68 6

33

728

6

10

712

6

531

82 6

111

786

6

306

36 7

325

990

6PF (10x)

Compute nodes

R-MAT scale

0

9

18

27

36

45

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34 1

28

35 2

56

36 5

12

37

10

24

Tim

e (s

)

0

10

20

30

40

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34

12

8

35

25

6

36

51

2

37

102

4

109

225

6 5

36

36

04

7

165

71 4

93

05 5

896

27 5

244

03 5

94

85

9 5

289

312

4

50

26

3 6

876

37

0 5

|V| 33

35

5

37

04

5

19

02

5 5

61

57

6

30

35

6 5

63

48

2 5

17

66

8 6

18

57

21 5PF (10x)

633

889

5

20

77

29 5

635

350

636

977

3 5

256

77

0 7

147

96

8 5

LCC NLCC

Fig 6 Runtime and pattern selectivity for weak scaling experiments brokendown to individual iterations for the chain (left) and tree (right) patternspresented in Fig 5 The X-axis labels present the R-MAT scale and the nodecount used for the experiment (Each node hosts two processors each with 18cores) The number of vertices and edges in each pruned solution is shownon the top of respective bar plot The pruning factors (PF) ie the orderof magnitude reduction in number of verticesedges compared to the originalbackground graph are also shown A flat line indicates perfect weak scalingTime for LLC and NLCC phases is presented with different colors

undirected edge factor of 16 For example a Scale 30 graphhas |V| = 230 and |E| asymp 32 times 230 (as we create a directedversion) We leverage degree information to create vertexlabels computed using the formula `(vi) = dlog2(d(vi)+1)e

Search Templates To stress our system we use templatesbased on patterns naturally occurring and relatively frequentin the template graphs The R-MAT (Fig 5) and WDC (Fig 7)patterns include vertex labels that are among the most frequentin the respective graphs The Reddit and IMDB patterns(Fig 11) include most of the vertex labels in these two graphsWe chose templates to exercise different constraint checkingscenarios the search templates have repeated vertex metadataand non-edge-monocyclic properties

Experimental Methodology All runtime numbers providedare averages over 10 runs For weak scaling experimentswe do not present scaling numbers for a single node asthis experiment does not involve network communication andbenefits from data locality For strong scaling experimentsthe smallest experiment uses 64 nodes as this is the lowestnumber of nodes that can load the graph in memory

A Weak Scaling Experiments

To evaluate the ability to process massive graphs we useweak scaling experiments and the synthetic R-MAT graphsup to Scale 37 (sim44T edges) and up to 1024 nodes (36864cores) Fig 5 shows the two search pattern used and Fig 6presents the runtimes Since there are multiple vertices in the

WDC-1 WDC-3WDC-2

info

net edugovde

ac

org

gov edu

net

biz

info

mil ac

ca

gov edu

edu

edu

biz

deit

ac acmil

Fig 7 WDC patterns using topsecond-level domain names as labels Thelabels selected are among the most frequent covering sim22 of the verticesin the WDC graph org covers sim220M vertices the 2nd most frequent aftercom and mil is the least frequent among these labels coveringsim153K vertices

WDC-1 WDC-2 WDC-347232 117 81913|V|

020406080

100120

64

12

8

25

6

51

2

10

24

0

10

20

30

40

64

12

8

25

6

51

2

10

24

0

17

34

51

68

85

102

119

136

153

170

64

12

8

25

6

51

2

10

24

126212 546 2550222|E|

VE Only

VE and EE

0

10

20

30

40

50

64

12

8

25

6

51

2

10

24

40

1130

2220

3310

4400

64

12

8

25

6

51

2

10

24

LCC NLCC

Tim

e (s

)

Compute nodes

15x

21x27x

18x

35x

66x

18x

29x36x

112x 60x

31x

Fig 8 Runtime for strong scaling experiments broken down to individualphases (LCC and NLCC are in different colors) for the three patterns presentedin Fig 7 The top row of X-axis labels represent the number of compute nodes(Each node hosts two processors each with 18 cores) The last two rows arenumber of vertices and edges in the pruned graph respectively For bettervisibility for WDC-1 (left plots) runtime for different iterations are split intotwo scales on the Y-axis LCC and NLCC-path constraints are at the bottomand LCC and NLCC-TDS constraints are at the top To highlight the impactof edge elimination for WDC-3 (right plots) we present results with edgeelimination disabled (top) and enabled (bottom) Speedup over the 64 nodeconfiguration is also shown on the top of each stacked bar plot

pattern with identical labels (at more than two-hop distance)the patterns require NLCC - path checking - to ensure no falsepositives in the pruned graph We see steady scaling all theway to the Scale 37 graph which has sim44 trillion edgeson 1024 nodes (36864 cores) Runtime is broken down tothe individual iteration to evaluate scaling and the individualcontribution of each intermediate step As a graph gets prunedthe subsequent iterations require less time Fig 6 includes (atthe top of each bar) the final number of vertices and edgesthat participate in the respective patterns Note that the NLCCphases (needed to guarantee a precise solution) do not deleteany vertex or edge hence no further LCC phase is invoked

B Strong Scaling Experiments

Fig 8 shows the runtimes for strong scaling experiments whenusing the real-world WDC graph on up to 1024 nodes (36864cores) Intuitively pattern matching on the WDC graph isharder than on the R-MAT graphs as the WDC graph is bothdenser and has a more skewed degree distribution We usethe patterns presented in Fig 7 WDC-1 is acyclic yet hasmultiple vertices with the same label and thus requires non-local constraint checking (PC and TDS) For better visibilitythe plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left) We noticenear perfect scaling for the LCC phases however some of

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 7: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

Algorithm 6 Non-local Constraint Checking Visitor1 visitor state vj - vertex that is visited2 visitor state token - the token is a tuple (t r) where t is an ordered list

of vertices that have forwarded the token and r is the hop-counter t0 isin tis the vertex that originated the token

3 visitor state msgtype - init forward or ack4 micro(vj C0 token) - tests if vj satisfies requirements of C0 for the current

state of token returns true if constraints are met false otherwise5 procedure VISIT(G vq)6 if α(vj) = false then return7 if msgtype = init and existqk isin ω(vj) where qk = q0 isin C0 then8 initiate a token vj is the token source9 tadd(vj) r larr 1 tokenlarr (t r) γinsert(vj false)

10 for all vi isin ε(vj) do11 vis larr NLCC VISITOR(vi token forward)12 vqpush(vis)

13 else if msgtype = forward then vj received a token14 if token isin τ(vj) then work aggregation optimization15 τ(vj)insert(token)16 else return ignore token if it was previously forwarded by vj17 if micro(vj C0 token) = true and tokenr lt |C0| then18 the walk can be extended with vj and it has not yet reached

|C0| length19 tokentadd(vj) tokenr larr tokenr + 120 for all vi isin ε(vj) do forward the token21 vis larr NLCC VISITOR(vi token forward)22 vqpush(vis)

23 else if micro(vj C0 token) = true and tokenr = |C0| then24 the walk can be extended with vj and it has reached

|C0| length25 if C0 is cyclic and t0 = vj then26 γget(vj)larr true return vj meets requirements of C027 else if C0 is acyclic and t0 6= vj then28 vis larr NLCC VISITOR(t0 token ack)29 vqpush(vis) send ack to the token originator t0 isin t30 else if msgtype = ack then31 γget(vj)larr true return vj meets requirements of C0

Termination Output and Match Enumeration Queries IfNLCC is not required the search terminates when no vertexis eliminated (or none of its potential matches is removed)in an LCC iteration Otherwise the search terminates whenall constraints in K0 have been verified and no vertex iseliminated (or none of its potential matches is removed) inthe following LCC phase The output is (i) the set of verticesand edges that survived the iterative elimination process and(ii) for each vertex in this set the mapping in the templatewhere a match has been identified A distributed enumerationor counting routine can operate on the pruned graph withthis information Alg 6 can be slightly modified to obtainthe enumeration of the matches in the background graph theconstraint used is the full template work aggregation is turnedoff and each possible match is verified

Metadata Store Metadata is stored independent of the graphtopology itself (which uses CSR format [26]) At initializationonly the required attributes are read from the file(s) stored ona distributed file system A light-weight distributed processbuilds the in-memory (or memory-mapped) metadata store On256 nodes for the 257 billion edge Web Data Commons graph[27] the metadata store can be built in under two minutesAlthough in this paper we consider vertex metadata (ielabels) only edge metadata is also supported

V EVALUATION

We present strong (sectV-B) and weak (sectV-A) scaling experi-ments of pruning on massive real-world and synthetic graphsadditionally we demonstrate full match enumeration startingfrom the pruned graph (sectV-C) we evaluate the effective-ness of the optimizations our system incorporates (sectV-F) wehighlight the use of our system in the context of realisticdata analytics scenarios (sectV-D) we explore time-to-solutionvs precisionguarantees trade-offs (sectV-E) and finally wecompare our solution with a recent work QFrag [28] (sectV-H)

Testbed The testbed is the 26PFlop Quartz cluster at theLawrence Livermore National Lab comprised of 2634 nodesand the Intel Omni-Path interconnect Each node has two 18-core Intel Xeon E5-2695v4 210GHz processors and 128GBof memory [29] We run one MPI process per core

Datasets We summarize the main characteristics of thedatasets used for evaluation and explain how we have gener-ated vertex labels where necessary For all graphs we createdundirected versions of the graphs two directed edges are usedto represent each undirected edge

The datasets used for evaluationType |V| 2|E| dmax

Web Data Commons [27] Real 35B 257B 95MReddit [30] Real 39B 14B 19MInternet Movie Database [31] Real 5M 29M 552KPatent [28] Real 27M 28M 789Youtube [28] Real 46M 88M 25KR-MAT up to Scale 37 [22] Synthetic 137B 44T 612M

Web Data Commons (WDC) graph is a web-graph whosevertices are webpages and edges are hyperlinks To createvertex labels we extract the top-level domain names fromthe webpage URL eg org or edu If the URL containsa common second-level domain name it is chosen over thetop-level domain name For example for oxacuk ac isselected as the vertex label A total of 2903 unique labelsare distributed among the 35B graph vertices We curatedthe Reddit social-media graph from an open archive [30]of billions of public posts and comments from RedditcomReddit allows its users to rate (upvote or downvote) othersrsquoposts and comments The graph has four types of verticesAuthor Post Comment and Subreddit (a category for posts)For the Post and Comment there are three possible labelsPositive Negative and Neutral (indicating the overall balanceof positive and negative votes) or No rating An edge ispossible between an Author and a Post an Author and aComment a Subreddit and a Post a Post and a Comment(to that post) and between two Comments that have a parent-child relationship The International Movie Database (IMDB)graph was curated from the publicly available data [31] Thegraph has five types of vertices Movie Genre Actress Actorand Director An edge is only possible between a Movie typevertex and a non-Movie type vertex We use the smaller Patentand YouTube graphs to compare published results by Serafiniet al [28] The synthetic R-MAT graphs exhibit approximatepower-law degree distribution These graphs were createdfollowing the Graph 500 [32] standards 2Scale vertices and an

2

4

3 7

5

3 7

TreeChain

7354723

Fig 5 Chain and Tree patterns used Both patterns have two pairs of verticeswith the same (numeric) label hence require non-local constraint checking(NLCC) more precisely path checking The labels used are the most frequentin the R-MAT graphs and cover sim30 of all the vertices in the graphs

Chain

28

620

5

16

020

5

15

50

56

5

41

990

6

16

34

58

5

499

182

6

86

35

0 7

15

08

95

6 6

2|E|

Tree

58

74 6

64

68 6

33

728

6

10

712

6

531

82 6

111

786

6

306

36 7

325

990

6PF (10x)

Compute nodes

R-MAT scale

0

9

18

27

36

45

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34 1

28

35 2

56

36 5

12

37

10

24

Tim

e (s

)

0

10

20

30

40

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34

12

8

35

25

6

36

51

2

37

102

4

109

225

6 5

36

36

04

7

165

71 4

93

05 5

896

27 5

244

03 5

94

85

9 5

289

312

4

50

26

3 6

876

37

0 5

|V| 33

35

5

37

04

5

19

02

5 5

61

57

6

30

35

6 5

63

48

2 5

17

66

8 6

18

57

21 5PF (10x)

633

889

5

20

77

29 5

635

350

636

977

3 5

256

77

0 7

147

96

8 5

LCC NLCC

Fig 6 Runtime and pattern selectivity for weak scaling experiments brokendown to individual iterations for the chain (left) and tree (right) patternspresented in Fig 5 The X-axis labels present the R-MAT scale and the nodecount used for the experiment (Each node hosts two processors each with 18cores) The number of vertices and edges in each pruned solution is shownon the top of respective bar plot The pruning factors (PF) ie the orderof magnitude reduction in number of verticesedges compared to the originalbackground graph are also shown A flat line indicates perfect weak scalingTime for LLC and NLCC phases is presented with different colors

undirected edge factor of 16 For example a Scale 30 graphhas |V| = 230 and |E| asymp 32 times 230 (as we create a directedversion) We leverage degree information to create vertexlabels computed using the formula `(vi) = dlog2(d(vi)+1)e

Search Templates To stress our system we use templatesbased on patterns naturally occurring and relatively frequentin the template graphs The R-MAT (Fig 5) and WDC (Fig 7)patterns include vertex labels that are among the most frequentin the respective graphs The Reddit and IMDB patterns(Fig 11) include most of the vertex labels in these two graphsWe chose templates to exercise different constraint checkingscenarios the search templates have repeated vertex metadataand non-edge-monocyclic properties

Experimental Methodology All runtime numbers providedare averages over 10 runs For weak scaling experimentswe do not present scaling numbers for a single node asthis experiment does not involve network communication andbenefits from data locality For strong scaling experimentsthe smallest experiment uses 64 nodes as this is the lowestnumber of nodes that can load the graph in memory

A Weak Scaling Experiments

To evaluate the ability to process massive graphs we useweak scaling experiments and the synthetic R-MAT graphsup to Scale 37 (sim44T edges) and up to 1024 nodes (36864cores) Fig 5 shows the two search pattern used and Fig 6presents the runtimes Since there are multiple vertices in the

WDC-1 WDC-3WDC-2

info

net edugovde

ac

org

gov edu

net

biz

info

mil ac

ca

gov edu

edu

edu

biz

deit

ac acmil

Fig 7 WDC patterns using topsecond-level domain names as labels Thelabels selected are among the most frequent covering sim22 of the verticesin the WDC graph org covers sim220M vertices the 2nd most frequent aftercom and mil is the least frequent among these labels coveringsim153K vertices

WDC-1 WDC-2 WDC-347232 117 81913|V|

020406080

100120

64

12

8

25

6

51

2

10

24

0

10

20

30

40

64

12

8

25

6

51

2

10

24

0

17

34

51

68

85

102

119

136

153

170

64

12

8

25

6

51

2

10

24

126212 546 2550222|E|

VE Only

VE and EE

0

10

20

30

40

50

64

12

8

25

6

51

2

10

24

40

1130

2220

3310

4400

64

12

8

25

6

51

2

10

24

LCC NLCC

Tim

e (s

)

Compute nodes

15x

21x27x

18x

35x

66x

18x

29x36x

112x 60x

31x

Fig 8 Runtime for strong scaling experiments broken down to individualphases (LCC and NLCC are in different colors) for the three patterns presentedin Fig 7 The top row of X-axis labels represent the number of compute nodes(Each node hosts two processors each with 18 cores) The last two rows arenumber of vertices and edges in the pruned graph respectively For bettervisibility for WDC-1 (left plots) runtime for different iterations are split intotwo scales on the Y-axis LCC and NLCC-path constraints are at the bottomand LCC and NLCC-TDS constraints are at the top To highlight the impactof edge elimination for WDC-3 (right plots) we present results with edgeelimination disabled (top) and enabled (bottom) Speedup over the 64 nodeconfiguration is also shown on the top of each stacked bar plot

pattern with identical labels (at more than two-hop distance)the patterns require NLCC - path checking - to ensure no falsepositives in the pruned graph We see steady scaling all theway to the Scale 37 graph which has sim44 trillion edgeson 1024 nodes (36864 cores) Runtime is broken down tothe individual iteration to evaluate scaling and the individualcontribution of each intermediate step As a graph gets prunedthe subsequent iterations require less time Fig 6 includes (atthe top of each bar) the final number of vertices and edgesthat participate in the respective patterns Note that the NLCCphases (needed to guarantee a precise solution) do not deleteany vertex or edge hence no further LCC phase is invoked

B Strong Scaling Experiments

Fig 8 shows the runtimes for strong scaling experiments whenusing the real-world WDC graph on up to 1024 nodes (36864cores) Intuitively pattern matching on the WDC graph isharder than on the R-MAT graphs as the WDC graph is bothdenser and has a more skewed degree distribution We usethe patterns presented in Fig 7 WDC-1 is acyclic yet hasmultiple vertices with the same label and thus requires non-local constraint checking (PC and TDS) For better visibilitythe plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left) We noticenear perfect scaling for the LCC phases however some of

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 8: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

2

4

3 7

5

3 7

TreeChain

7354723

Fig 5 Chain and Tree patterns used Both patterns have two pairs of verticeswith the same (numeric) label hence require non-local constraint checking(NLCC) more precisely path checking The labels used are the most frequentin the R-MAT graphs and cover sim30 of all the vertices in the graphs

Chain

28

620

5

16

020

5

15

50

56

5

41

990

6

16

34

58

5

499

182

6

86

35

0 7

15

08

95

6 6

2|E|

Tree

58

74 6

64

68 6

33

728

6

10

712

6

531

82 6

111

786

6

306

36 7

325

990

6PF (10x)

Compute nodes

R-MAT scale

0

9

18

27

36

45

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34 1

28

35 2

56

36 5

12

37

10

24

Tim

e (s

)

0

10

20

30

40

28

2

29

4

30

8

31 1

6

32 3

2

33 6

4

34

12

8

35

25

6

36

51

2

37

102

4

109

225

6 5

36

36

04

7

165

71 4

93

05 5

896

27 5

244

03 5

94

85

9 5

289

312

4

50

26

3 6

876

37

0 5

|V| 33

35

5

37

04

5

19

02

5 5

61

57

6

30

35

6 5

63

48

2 5

17

66

8 6

18

57

21 5PF (10x)

633

889

5

20

77

29 5

635

350

636

977

3 5

256

77

0 7

147

96

8 5

LCC NLCC

Fig 6 Runtime and pattern selectivity for weak scaling experiments brokendown to individual iterations for the chain (left) and tree (right) patternspresented in Fig 5 The X-axis labels present the R-MAT scale and the nodecount used for the experiment (Each node hosts two processors each with 18cores) The number of vertices and edges in each pruned solution is shownon the top of respective bar plot The pruning factors (PF) ie the orderof magnitude reduction in number of verticesedges compared to the originalbackground graph are also shown A flat line indicates perfect weak scalingTime for LLC and NLCC phases is presented with different colors

undirected edge factor of 16 For example a Scale 30 graphhas |V| = 230 and |E| asymp 32 times 230 (as we create a directedversion) We leverage degree information to create vertexlabels computed using the formula `(vi) = dlog2(d(vi)+1)e

Search Templates To stress our system we use templatesbased on patterns naturally occurring and relatively frequentin the template graphs The R-MAT (Fig 5) and WDC (Fig 7)patterns include vertex labels that are among the most frequentin the respective graphs The Reddit and IMDB patterns(Fig 11) include most of the vertex labels in these two graphsWe chose templates to exercise different constraint checkingscenarios the search templates have repeated vertex metadataand non-edge-monocyclic properties

Experimental Methodology All runtime numbers providedare averages over 10 runs For weak scaling experimentswe do not present scaling numbers for a single node asthis experiment does not involve network communication andbenefits from data locality For strong scaling experimentsthe smallest experiment uses 64 nodes as this is the lowestnumber of nodes that can load the graph in memory

A Weak Scaling Experiments

To evaluate the ability to process massive graphs we useweak scaling experiments and the synthetic R-MAT graphsup to Scale 37 (sim44T edges) and up to 1024 nodes (36864cores) Fig 5 shows the two search pattern used and Fig 6presents the runtimes Since there are multiple vertices in the

WDC-1 WDC-3WDC-2

info

net edugovde

ac

org

gov edu

net

biz

info

mil ac

ca

gov edu

edu

edu

biz

deit

ac acmil

Fig 7 WDC patterns using topsecond-level domain names as labels Thelabels selected are among the most frequent covering sim22 of the verticesin the WDC graph org covers sim220M vertices the 2nd most frequent aftercom and mil is the least frequent among these labels coveringsim153K vertices

WDC-1 WDC-2 WDC-347232 117 81913|V|

020406080

100120

64

12

8

25

6

51

2

10

24

0

10

20

30

40

64

12

8

25

6

51

2

10

24

0

17

34

51

68

85

102

119

136

153

170

64

12

8

25

6

51

2

10

24

126212 546 2550222|E|

VE Only

VE and EE

0

10

20

30

40

50

64

12

8

25

6

51

2

10

24

40

1130

2220

3310

4400

64

12

8

25

6

51

2

10

24

LCC NLCC

Tim

e (s

)

Compute nodes

15x

21x27x

18x

35x

66x

18x

29x36x

112x 60x

31x

Fig 8 Runtime for strong scaling experiments broken down to individualphases (LCC and NLCC are in different colors) for the three patterns presentedin Fig 7 The top row of X-axis labels represent the number of compute nodes(Each node hosts two processors each with 18 cores) The last two rows arenumber of vertices and edges in the pruned graph respectively For bettervisibility for WDC-1 (left plots) runtime for different iterations are split intotwo scales on the Y-axis LCC and NLCC-path constraints are at the bottomand LCC and NLCC-TDS constraints are at the top To highlight the impactof edge elimination for WDC-3 (right plots) we present results with edgeelimination disabled (top) and enabled (bottom) Speedup over the 64 nodeconfiguration is also shown on the top of each stacked bar plot

pattern with identical labels (at more than two-hop distance)the patterns require NLCC - path checking - to ensure no falsepositives in the pruned graph We see steady scaling all theway to the Scale 37 graph which has sim44 trillion edgeson 1024 nodes (36864 cores) Runtime is broken down tothe individual iteration to evaluate scaling and the individualcontribution of each intermediate step As a graph gets prunedthe subsequent iterations require less time Fig 6 includes (atthe top of each bar) the final number of vertices and edgesthat participate in the respective patterns Note that the NLCCphases (needed to guarantee a precise solution) do not deleteany vertex or edge hence no further LCC phase is invoked

B Strong Scaling Experiments

Fig 8 shows the runtimes for strong scaling experiments whenusing the real-world WDC graph on up to 1024 nodes (36864cores) Intuitively pattern matching on the WDC graph isharder than on the R-MAT graphs as the WDC graph is bothdenser and has a more skewed degree distribution We usethe patterns presented in Fig 7 WDC-1 is acyclic yet hasmultiple vertices with the same label and thus requires non-local constraint checking (PC and TDS) For better visibilitythe plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left) We noticenear perfect scaling for the LCC phases however some of

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 9: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

21 83 24Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

100E+12

WDC-1 WDC-2 WDC-3

Nu

mb

er o

f ac

tive

ver

tice

s an

d e

dge

s (

log-

scal

e) Edges in the WDC graph

Vertices in the WDC graph

Edges

Vertices

Fig 9 Evolution of the number of active vertices and edges after each iterationfor the same experiments as in Fig 8 The bottom row of X-axis labelsrepresent number of iterations required for a precise solution Note that theY-axis is on log scale

the NLCC phases do not show linear scaling (explained insectV-G) WDC-2 is an example of a pattern with multiple cyclessharing edges and relies on CC and TDS constraint checkingto guarantee no false positive matches WDC-2 shows near-linear scaling with sim13 of the total time spent in the firstLCC phase and little time spent in the NLCC phases WDC-3 is a monocyclic template and when edge elimination isused (bottom right) shows near linear scaling for both LCCand NLCC phases The top right plot highlights the keyperformance impact of edge elimination without it the NLCCphases take almost one order of magnitude longer and theentire pruning 2ndash9times longer

TABLE II Match enumeration statistics Number of matches for the chainand tree patterns (Fig 5 top table) and WDC (Fig 7) Reddit and IMDB(Fig 11) patterns (bottom table) and the enumeration times starting from thepruned graphs Note that for WDC-1 and WDC-3 due to the extremely largenumber of matches (over half billion in each case) we stop enumeration early

R-MAT Compute Chain TreeScale Nodes Count Time (s) Count Time (s)

28 2 2716 1036 1186 103831 16 3747 1054 1488 104034 128 7529 1128 3766 112837 1024 55710 1010 32532 553

Pattern WDC-1 WDC-2 WDC-3 RDT-1 RDT-2 IMDB-1Count 668M 2444 149B 24K 518K 168MTime 4min 184s 40h+ 678s 485s 10h

Compute Nodes 64 64 16 64 64 8

C Match Enumeration

As our technique prunes the graph by orders of magnitude (seeFig 9 and 12(b)) match counting and full match enumerationare now feasible Table II (top) lists the number of distinctmatches and the time to enumerate the chain and tree patternson some of the R-MAT graphs we used While these resultsprove that our match enumeration routines scale well thematch counts and enumeration time for WDC Reddit andIMDB patterns listed on Table II (bottom) are more revealing

There are three important takeaways First while our matchenumeration technique is able to enumerate an immensenumber of matches (see for example results for WDC-1 andWDC-3 with 500+ million matches or even IMDB-1 with15+ million matches) presenting results as pruned vertexedge

Fig 10 The WDC-2 pattern matches in the background graph The numberof matches in each of the six connected components are also shown

sets (with less than 100000 vertices) avoids the combinatorialexplosion and makes is feasible to imagine further analyticsSecond as Fig 10 clearly shows presenting the results as theunion of all matches (rather than explicit match enumeration)is not only more space efficient but also in some cases eveneasy to directly understand by a human analyst Finally whilewe omit details due to lack of space we note that key tosupporting match enumeration is edge pruning this reducesthe edge density in the pruned WDC graph by a factor of10ndash15times (compared to using vertex pruning alone)

D Example Use Case Social Network Analysis

We demonstrate the ability of our scalable pattern matchingtechnique to support complex data analytics scenarios in thecontext of social networks Todayrsquos user experience on socialmedia platforms is tainted by the existence of maliciousactors such as bots trolls and spammers This highlights theimportance of detecting unusual activity patterns that mayindicate potential malicious attacks We present three usecases one for the IMDB graph and two queries that attemptto uncover suspicious activity in the Reddit dataset

Fig 11 summarizes the scenarios we target and presents thecorresponding search patterns Fig 12 shows runtime for thesescenarios broken down to individual LCC and NLCC iterationlevels Although RDT-1 is much less frequent than RDT-2 onthe same 64 nodes pruning for RDT-1 takes more than 3timeslonger to complete as it spends more time verifying the non-local constraints Although both patterns have a 6-cycle RDT-2 allows verification of the two smaller cycles in isolation(For NLCC a longer path typically results in larger generatedmessage traffic) IMDB-1 on 8 nodes spends the majority ofthe time verifying non-local specifically TDS constraints

E Pruning PrecisionGuarantees vs Time-to-Solution

Our approach gradually refines Glowast(Vlowast Elowast) down to the com-plete set of vertices and edges that participate in at least onematch and guarantees no false positives Given that this isan iterative process it is natural to investigate at what rateGlowast is refined and whether there are opportunities to tradebetween the precision (or existence of precision guarantees)of an intermediary solution and compute time

Fig 13 shows the evolution of the precision of the interme-diate solution over time for the various patterns (We defineprecision as the ratio between the number of distinct verticesthat participate in at least one match and the size of refined

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 10: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

C+

P-

A

A

S

C-

P+

S

C+

A

A

C-

P+

C-

Actress

Director

MovieActor

Genre

Genre

Movie

RDT-1 IMDB-1RDT-2

Author CommentPost Subreddit

+ More Up votes - More Down votes

Fig 11 The scenarios and their corresponding templates for the Reddit(RDT) and IMDB graphs RDT-1 (left) identify users with adversarialpostercommenter relationship Each author makes at least two posts or twocomments respectively Comments to posts that with more upvotes (P+)have a balance of negative votes (C-) and comments to posts with moredownvotes (P-) have a positive balance (C+) The posts must be underdifferent Subreddits (category) RDT-2 (center) identify all postercommenterpairs where the commenter makes at least two comments to the same postone directly to the post and one in response to a comment The poster alsomakes a comment in response to a comment The commenter always receivesnegative rating (C-) to a popular post (P+) however comments (to the samepost) by the poster has a positive rating (C+) IMDB-1 (right) find all theactresses actors and directors that worked together at least on two differentmovies that fall under at least two similar genres

RDT-1 RDT-2 IMDB-157194 2614547 104322|V|

0

130

260

390

520

650

780

910

1040

1170

1300

0

50

100

150

200

250

300

350

400

195160 7163856 6398502|E|

LCC NLCC

Tim

e (s

)

0

40

80

120

160

200

240

280

320

(a)

105 123Iterations

100E+00

100E+01

100E+02

100E+03

100E+04

100E+05

100E+06

100E+07

100E+08

100E+09

100E+10

100E+11

RDT-1 RDT-2

Edges in the Reddit graph

Vertices in the Reddit graph

Edges

Vertices

(b)Fig 12 (a) Runtime for the graph analytics patterns presented in Fig 11The labels on X-axis represent the number of vertices and edges in thepruned graph Note that Y-axes have different scales (b) Number of activevertices and edges after each iteration for the same experiments for Reddit asin (a) The labels on bottom row of X-axis represent the number of iterationsrequired Note that the Y-axis is on log scale

vertex set Vlowast at the end of an iteration) We note that (i) therate at which precision improves is pattern dependent and (ii)for some patterns even after precision reaches 100 and nomore vertices are pruned the algorithm continues to operate tooffer guarantees that no false positives are left For example inFig 13(a) WDC-1 quickly reaches 100 precision however99 of the execution time is spent to verify five non-localconstraints (top left WDC-1 plot in Fig 8) to guaranteeno false positive match WDC-3 however shows differentbehaviour the complex structure does not reach 90 precisionuntil the very end and converges to 100 quickly afterwardsWe believe that the rate at which precision is achieved is partlyinfluenced by the order in which constraints are verified andour heuristics for constraint verification ordering leave roomfor improvement

F Impact of Design Decisions and Strategic Optimizations

Edge Elimination Fig 8 (right plots) show scalabilityand performance gains as a result of enabling edge elimi-nation Without edge elimination the WDC-3 pattern resultsin 3180678 edges selected (some are false positives) Edgeelimination identifies the true positive matches and reducesthe number of active edges to 255022 In other words thegraph is 125times sparser which in turn improves overall messageefficiency of the system

0

20

40

60

80

100

0

17

34

51

68

85

10

2

11

9

13

6

15

3

17

0

Pre

cisi

on

(

)

Timeline (s)

WDC-1WDC-2WDC-3

(a) WDC

0

20

40

60

80

100

0 5

10

15

20

25

30

35

40

45

50

Pre

cisi

on

(

)

Timeline (s)

Scale 28 Chain

Scale 37 Chain

Scale 28 Tree

Scale 37 Tree

(b) R-MATFig 13 Vertex set precision over the lifetime of an execution for (a) WDC(Fig 7) and (b) R-MAT (Fig 5) patterns The X-axis presents the timelinewhile the Y-axis is the precision achieved by the end of an iteration Themarkers indicate the moment in time when 100 precision has been achievedThe presented timeline for the WDC-1 is limited to 170th second for bettervisibility (WDC-1 achieves 100 precision in less than 20 seconds) For theR-MAT patterns we only show numbers for Scale 28 and 37

0

2000

4000

6000

8000

10000

12000

WDC-1 RDT-1

Tim

e (s

)

SynchronousAsynchronous

(a)

0

50

100

150

200

WDC-1 WDC-2 WDC-3

Tim

e (

s)

No aggregation

Aggregation

(b)

0

100

200

300

400

WDC-2

Tim

e (s

)

0

1200

2400

3600

4800

6000

7200

RDT-1

TDS-onlyAll constraints

(c)

0

2000

4000

6000

8000

10000

12000

14000

64 128

Tim

e (s

)

NLBLB

0

300

600

900

1200

1500

1800

2100

64WDC-1 RDT-1

31x

13x

17x

Compute nodes

(d)Fig 14 (a) Comparing synchronous and asynchronous NLCC (b) Impact ofwork aggregation on runtime for the WDC patterns (for the sake of readabilityonly a subset of non-local constraints are considered for WDC-1) (c) Runtimeperformance when only TDS constraints are used vs all NLCC constraintsused (Here RDT-1 does not finish after two hours) Note that in (a) and(c) we did not apply load balancing to RDT-1 All experiments in (a) (b)and (c) use 64 compute nodes (d) Impact of load balancing on runtime forthe the WDC-1 and RDT-1 patterns We compare two cases without loadbalancing (NLB) and with load balancing (LB) For WDC-1 we show resultsfor two scales on 64 and 128 nodes Speedup achieved by LB over NLB isalso shown on the top of each bar

Asynchronous Communication Our system is designed toharness the advantages of an asynchronous graph processingframework yet a synchronous one could easily support thesame algorithms Fig 14(a) shows runtime for two patternsthat benefit the most from asynchronicity for WDC-1 andRDT-1 (265times and 35times gains respectively) compared with asynchronous version that adds a barrier after each NLCC tokenpropagation step Asynchronous NLCC makes it possible forall walks to progress independently without synchronizationoverheads Synchronous NLCC is implemented within Havo-qGT as well

Work Aggregation Fig 14(b) shows the performance gainsenabled by the work aggregation strategy employed by NLCC(presented in sectIII and Alg 6) The magnitude of the gain 10-50 is data dependent and more pronounced when the patternis abundant eg WDC-1 has 600M+ instances (Table II)

Constraint Selection For patterns for which full TDS is re-quired for precision guarantees the path and cycle constraints

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 11: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

are there for performance optimizations (their goal is to pruneaway the non-matching part of the graph early) and we areinterested to evaluate the impact of these optimizations To thisend we compare time-to-solution for a configuration that gen-erates and uses all NLCC constraints and one that uses onlythe TDS constraints required to guarantee 100 precisionOur experiments show that although it increases the numberof iterations verifying simpler and smaller substructures firstis extremely effective for some patterns (eg RDT-1) thesystem is not able to complete in reasonable time withoutthese constraints for others (eg WDC-2) these constraintsenable a 239times speedup (Fig 14(c))

G Load Balancing

Load imbalance issues are inherent to problems involvingirregular data structures such as graphs when these needto be partitioned for processing over multiple nodes Forour pattern matching solution load imbalance can be furthercaused by two artifacts first over the course of execution oursolution causes the workload to mutate (ie we prune awayvertices and edges) and second by nonuniform distribution ofmatches in the background graph the vertices and edges thatparticipate in the matches may reside on a small potentiallyconcentrated portion of the graph Load imbalance can indeedoccur for example for the relatively rare WDC-2 pattern whenusing 128 nodes for example the vertices that participate inthe final selection are distributed over as few as 114 partitionsout of 4608 The distribution is concentrated 90 of thematching edges are on 85 partitions while more than halfof of the matching edges are located on only 15 partitionsFor the more frequent WDC-1 pattern 99 of the matchingvertices are part of a single connected component 50 of thematching edges are on less than 5 of the total partitions ona 64 node deployment which becomes less than 3 on a 128node deployment

We employ a pseudo-dynamic load balancing strategyFirst we checkpoint the current state of execution the prunedgraph ie the set of active vertices and edges and the per-vertex state indicating template matches ω(vj) in Alg 2Next using HavoqGTs graph partitioning module we reshufflevertex-to-processor assignment to evenly distribute verticesand edges across processing cores Processing is then resumedon the rebalanced workload (Note that depending on the sizethe the pruned graph it is possible to resumed processingon a smaller deployment) Over the course of the executioncheckpointing and rebalancing can be repeated as needed

As a proof of feasibility to examine the impact of this tech-nique we analyze the runs for WDC-1 and RDT-1 patterns(We chose some of the real-world workloads as they are morelikely to lead to imbalance than synthetically generated load)Fig 14(d) compares performance of the pruning algorithmswith and without load balancing For these examples weperform workload rebalancing only once for WDC-1 beforeverifying the TDS constraints and for RDT-1 when the prunedgraph is four orders of magnitude smaller The extent ofload imbalance is more severe for WDC-1 on the smaller

64 node deployment compared to using 128 nodes workloadrebalncing improves runtime by 31times and 13times on 64 and 128nodes respectively In the case of RDT-1 the gain in runtimeas a results of load balancing is 17times Given the pruned graphsare orders magnitude smaller then the original graph checkpointing rebalancing and relaunching the computation takesless than two minutes which is negligible compared to the gainachieved in time-to-solution In Table II we run enumerationfor WDC-1 WDC-2 and RDT-1 on the rebalanced graphs

TABLE III Performance comparison between QFrag and our pattern matchingsystem The table shows runtime in seconds for full enumeration for QFragpruning and full enumeration for our distributed system (labeled PruneJuice-distributed) and for a single node implementation of our graph pruning-based approach tailored for a shared memory system (labeled PruneJuice-shared) For PruneJuice we split time-to-solution into pruning (top row)and enumeration (bottom row) times We use the same graphs (Patent andYourTube) and some of the query templates (Q4 ndash Q7) used for evaluationof QFrag in [28]

QFrag PruneJuice-distributed PruneJuice-sharedPatent Youtube Patent Youtube Patent Youtube

Q4 419 8080238 0704 0100 04000223 1143 0010 0010

Q6 599 10260874 2340 0070 17300065 0301 0005 0010

Q7 636 11890596 1613 0130 08200039 0180 0005 0010

Q8 1005 14480959 2633 0100 13700049 0738 0001 0010

H Comparison with Existing Work QFrag

We empirically compare our work with a recent work(2017) on pattern matching from the database communityQFrag [28]

Similar to our solution QFrag targets exact pattern match-ing yet there are two main differences QFrag assumes thatthe entire graph fits in the memory of each compute nodeand uses data replication to enable search parallelism Moreimportantly QFrag employs a sophisticated load balancingstrategy between parallel instances of a search to achieve goodscalability QFrag is implemented on top of Apache Sparkand Giraph [19] In QFrag each replica runs an instance of apattern enumeration algorithms called TurboISO [33] Throughevaluation the authors demonstrated QFragrsquos performance ad-vantages over two other distributed pattern matching systems(i) TriAD [34] an MPI-based distributed RDF [35] enginebased on an asynchronous distributed join algorithm and (ii)GraphFrames [36] [37] a graph processing library for ApacheSpark also based on distributed join operations

Given that we have demonstrated the good scalability of oursolution (Serafini et al [28] demonstrate equally good scalabil-ity properties for QFrag yet on much smaller graphs) we areinterested to establish a comparison baseline at single nodescale To this end we run experiments on a modern sharedmemory machine with 60 CPU-cores and use four templatequeries and two real-world graphs (Patent and Youtube) thatwere used for evaluation of QFrag [28] We run QFrag with 60threads and HavoqGT with 60 MPI processes The results are

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 12: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

summarized in Table III QFrag runtimes for match enumer-ation (first pair of columns) are comparable with the resultspresented in [28] so we have reasonable confidence that wereplicate their experiments well With respect to combinedpruning and enumeration time our system (second pair ofcolumns presenting pruning and enumeration time separately)is consistently faster than QFrag on all the graphs for all thequeries We note that our solution does not take advantage ofshared memory of the machine at the algorithmic or imple-mentation level (we use different processes one MPI processper core) and has the system overhead of MPI-communicationbetween processes Additionally unlike QFrag our system isnot handicapped by the memory limit of a single machine asit supports graph partitioning To highlight the effectivenessof our technique and get some intuition on the magnitudeof the MPI overheads in this context we implemented ourtechnique for shared memory and present runtimes (whenusing 60 threads) for the same set of experiments in Table III(right two columns) We notice up to an order of magnitudebetter performance the main cost for our technique comparedto the distributed implementation running on a single node

In summary our system works about 4ndash10times faster thanQFrag and if excluding distributed system memory overheadsand considering the pruning time for the shared memorysolution and conservatively reusing enumeration runtime forthe distributed solution it is about 6ndash100times faster than QFrag

VI LIMITATIONS DISCUSSION

We categorize the limitations of our proposed system basedon their respective sources

Limitations stemming from major design decisions Oursystem inherits limitations of a system that performs exactmatching (compared to a system that focuses on approximatematching eg based on graph simulation [7]) Similarlyour system inherits all limitations of its communication andmiddleware infrastructure MPI and HavoqGT One exampleis the lack of flow control in these infrastructures whichsometimes leads to message buildup and system collapse

Limitations stemming from the targeted uses cases In thesame vein we note that our system targets a graph analyticsscenario (queries that need to cover the entire graph) ratherthan the traditional graph database queries that attempt to finda specific pattern around a vertex indicated by the user (whereother systems may perform better)

Limitations stemming from attempting to design a genericsystem Systems optimized for specific patterns may performbetter (eg systems optimized to enumerate triangles ortreelets [39] or systems relying on multi-join indices to supportpatterns with limited diameter)

Limitations stemming from incomplete understanding andwork in progress While we propose heuristics that appear towork well for our experiments one of the key challenges isidentifying an optimal set of constraints and their execution or-der We believe graph statistics at different stages in executioncan be used to dynamically select effective constraints

VII RELATED WORK

The volume of related work on graph processing in gen-eral [46] [19] [20] [47] [48] [49] and on pattern match-ing algorithms in particular [5] [11] [12] [50] [6] [1] ishumbling We summarize closely related work in Table IV

Sequential Algorithms Early work on graph pattern match-ing mainly focused on solving the problem of graph isomor-phism [5] The well-known Ullmannrsquos algorithm [5] and itsextensions (in terms of join order and pruning strategies)eg VF2 [11] and QuickSI [51] belong to the family oftree-search based algorithms A recent effort TurboISO [33]is considered to be the state-of-the-art of tree-search basedsequential subgraph isomorphism algorithm For large graphsa tree search that fails mid-way and has to backtrack canbe expensive Efficient distributed implementation of this ap-proach is difficult due to the costs associated with maintaininglarge intermediate search state across multiple physical nodesthat participate in the search Perhaps the best known exactmatching algorithm that does not belong to the family of tree-search based algorithms is Nauty due to McKay [12] whichis based on canonical labeling of the background graph Thisapproach however has high preprocessing overhead

Subgraph Indexing In the same spirit as database index-ing indexing of frequent subgraph structures is an approachattempted by some in order to reduce the number of join op-erations and lower query response time eg SpiderMine [50]and R-Join [52] Unfortunately for a billion-edge graph thisapproach is infeasible First searching frequent subgraphs ina large graph is expensive Second depending on topologyof the template(s) and the background graph the size of theindex is often superlinear relative to the size of the graph [43]

Distributed Solutions Here we focus on projects thatprovide distributed pattern matching and demonstrate itat some scale Table IV summarizes the key differentiat-ing aspects and the scale achieved The best scale is of-fered by Plantengarsquos [38] MapReduce implementation ofthe walk-based algorithm for inexact matching originallyproposed in [6] Unlike ours Plantengarsquos system can findexactapproximate matches only for a restricted class of smallpatterns Plantega demonstrated performance using a 107billion edge graph the largest-scale experiment to date (ex-cluding ours) SAHAD [39] is a MapReduce implementationof the color-coding algorithm originally developed for findingtree-like patterns (treelet) in protein-protein interaction net-works SAHAD follows a hierarchical sub-template explore-join approach Its application was presented only on smallgraphs with up to sim300M edges FASICA [40] is also acolor-coding-based system for approximate treelet countingwhose MPI-based implementation offers superior performanceto SAHAD Chakaravarthy et al [41] extended the color-coding algorithm to count patterns with cycles and presented aMPI-based distributed implementation However they demon-strated performance on graphs with only a few million edgesAlthough QFrag outperforms many of its competitors in termsof time-to-solution it replicates the entire graph in the memory

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 13: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

TABLE IV Comparison of past work on distributed pattern matching The table highlights the characteristics of the solution presented (exact vs approximatematching) its implementation infrastructure and summarizes the details of the largest-scale experiment performed We highlight the fact that our solution isunique in terms of demonstrated scale ability to perform exact matching and ability to retrieve all matches

Contribution ModelFramework Match Max-Query

MetadataCompute Max-Real Max-Synthetic

Language Type Size Nodes Graph GraphPlantenga [38] Graph Walk Hadoop Approx 4-cliques Real 64 107B edges R-MAT Scale 20QFrag [28] Tree-based Spark Exact 7 edges Real 10 117M edges NASAHAD [39] Explore-Join Hadoop Approx 12 vertices Synthetic 40 NA 269M edgesFASICA [40] Explore-Join MPI Approx 12 vertices NA 15 117M edges Erdos-Renyi 1M edgesChakaravarthy etal [41] Explore-Join MPI Approx 10 vertices NA BGQ-512 27M edges R-MATPGXDAsync [14] Async DFS JavaC++ Exact 4 edges Synthetic 32 NA 2B edges (Unif rand)Gao et al [42] Explore-Join Giraph Approx 50 vertices Synthetic 28 37B edges NASun et al [43] Explore-Join CNet4 Exact 15 vertices Synthetic 12 165M edges 4B verticesMa et al [44] Graph Simulation Python Approx 15 vertices Type only 16 51M edges 100M verticesFard et al [45] Graph Simulation GPS Approx NA NA 8 300M edges NA

of each node which limits its applicability to small graphs onlyPGXDAsync [14] relies on asynchronous depth-first traversaland incorporates flow control with a deterministic guaranteeof search completion under a finite amount of memory BothQFrag and PGXDAsync are demonstrated at a much smallerscale (in terms of graph sizes and number of compute nodes)than in this paper Sun et al [43] present an exact subgraphmatching algorithm which follows the explore-join approachand demonstrate it on large synthetic graphs and larger querygraphs than in [38]

Approximate Matching Recently a new family of approxi-mate matching algorithms based on graph simulation [53] hasbeen proposed [1] [45] [54] As opposed to exact matchinggraph simulation algorithms relax matching constraints egmatching based on vertex attributes and their connectivityconstraints in the query [7] Simulation based algorithms havequadraticcubic time-complexity and have been proposed asa possible solution for emerging matching problems whenlarge-scale graphs are involved [53] Two inexact matchingalgorithms based on graph simulation are introduced in [45][44] although results from both are only presented on rela-tively small real-world graphs Gao et al introduce anotherapproximate matching algorithm based on explore-join [42]and evaluate it on even larger query patterns than in [43] Herea query pattern is converted into a single-sink Directed AcyclicGraph (DAG) and message transition follows its topology

VIII CONCLUSION

This paper presents a new algorithmic pipeline to supportpattern matching on large-scale metadata graphs on largedistributed-memory machines We capitalize on the idea ofgraph pruning and develop asynchronous algorithms that useboth vertex and edge elimination to iteratively prune theoriginal graph and reduce it to a subgraph which represents theunion of all matches We have developed pruning techniquesthat guarantee a solution with 100 precision (ie no falsepositives in the final pruned graph) and 100 recall (ieall vertices and edges participating in matches are included)for arbitrary search patterns including patterns with repeatedvertex metadata and patterns that have arbitrary cycles Ouralgorithms are vertex-centric and asynchronous thus they map

well onto existing high-performance graph frameworks Ourevaluation using up to 257 billion edge real-world web-graphsand up to 44 trillion edge synthetic R-MAT graphs on up to1024 nodes (36864 cores) confirms the scalability of our so-lution We demonstrate that depending on the search templateour approach prunes the graph by orders of magnitude whichenables full pattern enumeration and counting on graphs withtrillions of edges Our success stems from a number of keydesign ingredients asynchronicity aggressive vertex and edgeelimination while harnessing massive parallelism intelligentwork aggregation to ensure low message overhead effectivepruning constraints and lightweight per-vertex state

ACKNOWLEDGEMENT

This work was performed in part under the auspices ofthe US Department of Energy by Lawrence LivermoreNational Laboratory under contract DEAC52-07NA27344(LLNL-CONF-735657) Experiments were performed at theLivermore Computing facility

REFERENCES

[1] W Fan X Wang and Y Wu ldquoDiversified top-k graph patternmatchingrdquo Proc VLDB Endow vol 6 no 13 pp 1510ndash1521 Aug2013 [Online] Available httpdxdoiorg101477825362582536263

[2] N Alon P Dao I Hajirasouliha F Hormozdiari and S C SahinalpldquoBiomolecular network motif counting and discovery by color codingrdquoBioinformatics vol 24 no 13 pp i241ndashi249 Jul 2008 [Online]Available httpdxdoiorg101093bioinformaticsbtn163

[3] Z Zeng J Wang L Zhou and G Karypis ldquoOut-of-core coherentclosed quasi-clique mining from large dense graph databasesrdquo ACMTrans Database Syst vol 32 no 2 Jun 2007 [Online] Availablehttpdoiacmorg10114512425241242530

[4] D CONTE P FOGGIA C SANSONE and M VENTO ldquoThirtyyears of graph matching in pattern recognitionrdquo International Journalof Pattern Recognition and Artificial Intelligence vol 18 no 03 pp265ndash298 2004 [Online] Available httpwwwworldscientificcomdoiabs101142S0218001404003228

[5] J R Ullmann ldquoAn algorithm for subgraph isomorphismrdquo JACM vol 23 no 1 pp 31ndash42 Jan 1976 [Online] Availablehttpdoiacmorg101145321921321925

[6] J W Berry B Hendrickson S Kahan and P Konecny ldquoSoftware andalgorithms for graph queries on multithreaded architecturesrdquo in 2007IEEE International Parallel and Distributed Processing SymposiumMarch 2007 pp 1ndash14

[7] W Fan J Li S Ma N Tang Y Wu and Y Wu ldquoGraphpattern matching From intractable to polynomial timerdquo Proc VLDBEndow vol 3 no 1-2 pp 264ndash275 Sep 2010 [Online] Availablehttpdxdoiorg101477819208411920878

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 14: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

[8] H Tong C Faloutsos B Gallagher and T Eliassi-Rad ldquoFastbest-effort pattern matching in large attributed graphsrdquo in Proceedingsof the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining ser KDD rsquo07 New York NY USAACM 2007 pp 737ndash746 [Online] Available httpdoiacmorg10114512811921281271

[9] A Ching S Edunov M Kabiljo D Logothetis and S MuthukrishnanldquoOne trillion edges Graph processing at facebook-scalerdquo Proc VLDBEndow vol 8 no 12 pp 1804ndash1815 Aug 2015 [Online] Availablehttpdxdoiorg101477828240322824077

[10] S A Myers A Sharma P Gupta and J Lin ldquoInformation networkor social network The structure of the twitter follow graphrdquo inProceedings of the 23rd International Conference on World Wide Webser WWW rsquo14 Companion New York NY USA ACM 2014pp 493ndash498 [Online] Available httpdoiacmorg10114525679482576939

[11] L P Cordella P Foggia C Sansone and M Vento ldquoA (sub)graphisomorphism algorithm for matching large graphsrdquo IEEE Trans PatternAnal Mach Intell vol 26 no 10 pp 1367ndash1372 Oct 2004[Online] Available httpdxdoiorg101109TPAMI200475

[12] B D Mckay and A Piperno ldquoPractical graph isomorphism iirdquo JSymb Comput vol 60 pp 94ndash112 Jan 2014 [Online] Availablehttpdxdoiorg101016jjsc201309003

[13] C H C Teixeira A J Fonseca M Serafini G Siganos M J Zakiand A Aboulnaga ldquoArabesque A system for distributed graph miningrdquoin Proceedings of the 25th Symposium on Operating Systems Principlesser SOSP rsquo15 New York NY USA ACM 2015 pp 425ndash440[Online] Available httpdoiacmorg10114528154002815410

[14] N P Roth V Trigonakis S Hong H Chafi A Potter B Motikand I Horrocks ldquoPgxdasync A scalable distributed graph patternmatching enginerdquo in Proceedings of the Fifth International Workshopon Graph Data-management Experiences amp Systems ser GRADESrsquo17New York NY USA ACM 2017 pp 71ndash76 [Online] Availablehttpdoiacmorg10114530784473078454

[15] T Reza C Klymko M Ripeanu G Sanders and R Pearce ldquoTowardspractical and robust labeled pattern matching in trillion-edge graphsrdquo in2017 IEEE International Conference on Cluster Computing (CLUSTER)Sept 2017 pp 1ndash12

[16] A Lulli E Carlini P Dazzi C Lucchese and L Ricci ldquoFast connectedcomponents computation in large graphs by vertex pruningrdquo IEEETransactions on Parallel and Distributed Systems vol 28 no 3 pp760ndash773 March 2017

[17] F Zhou S Mahler and H Toivonen Simplification ofNetworks by Edge Pruning Berlin Heidelberg SpringerBerlin Heidelberg 2012 pp 179ndash198 [Online] Availablehttpsdoiorg101007978-3-642-31830-6 13

[18] A Kusum K Vora R Gupta and I Neamtiu ldquoEfficient processingof large graphs via input reductionrdquo in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing ser HPDC rsquo16 New York NY USA ACM 2016 pp 245ndash257 [Online] Available httpdoiacmorg10114529072942907312

[19] ldquoGiraphrdquo 2016 [Online] Available httpgiraphapacheorg[20] J E Gonzalez Y Low H Gu D Bickson and C Guestrin

ldquoPowergraph Distributed graph-parallel computation on natural graphsrdquoin Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation ser OSDIrsquo12 Berkeley CAUSA USENIX Association 2012 pp 17ndash30 [Online] Availablehttpdlacmorgcitationcfmid=23878802387883

[21] R Pearce M Gokhale and N M Amato ldquoFaster parallel traversalof scale free graphs at extreme scale with vertex delegatesrdquo inProceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo14 PiscatawayNJ USA IEEE Press 2014 pp 549ndash559 [Online] Availablehttpsdoiorg101109SC201450

[22] D Chakrabarti Y Zhan and C Faloutsos ldquoR-mat A recursive modelfor graph miningrdquo in Proceedings of the Fourth SIAM Int Conf onData Mining Society for Industrial Mathematics 2004 p p 442

[23] ldquoHavoqgtrdquo 2016 [Online] Available httpsoftwarellnlgovhavoqgt[24] R Pearce M Gokhale and N M Amato ldquoScaling techniques

for massive scale-free graphs in distributed (external) memoryrdquo inProceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing ser IPDPS rsquo13 Washington DCUSA IEEE Computer Society 2013 pp 825ndash836 [Online] Availablehttpdxdoiorg101109IPDPS201372

[25] M P Wellman and W E Walsh ldquoDistributed quiescence detection inmultiagent negotiationrdquo in Proceedings Fourth International Conferenceon MultiAgent Systems 2000 pp 317ndash324

[26] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Proceedings ofthe Conference on High Performance Computing Networking Storageand Analysis ser SC rsquo09 New York NY USA ACM 2009 pp 181ndash1811 [Online] Available httpdoiacmorg10114516540591654078

[27] O L Robert Meusel Christian Bizer ldquoWeb data commons -hyperlink graphsrdquo 2016 [Online] Available httpwebdatacommonsorghyperlinkgraphindexhtml

[28] M Serafini G De Francisci Morales and G Siganos ldquoQfragDistributed graph search via subgraph isomorphismrdquo in Proceedingsof the 2017 Symposium on Cloud Computing ser SoCC rsquo17 NewYork NY USA ACM 2017 pp 214ndash228 [Online] Availablehttpdoiacmorg10114531274793131625

[29] ldquoQuartzrdquo 2017 [Online] Available httpshpcllnlgovhardwareplatformsQuartz

[30] ldquoReddit public datardquo 2016 [Online] Available httpsgithubcomdewarimreddit-data-tools

[31] ldquoImdb public datardquo 2016 [Online] Available httpwwwimdbcominterfaces

[32] ldquoGraph 500 benchmarkrdquo 2016 [Online] Available httpwwwgraph500org

[33] W-S Han J Lee and J-H Lee ldquoTurboiso Towards ultrafast and robustsubgraph isomorphism search in large graph databasesrdquo in Proceedingsof the 2013 ACM SIGMOD International Conference on Management ofData ser SIGMOD rsquo13 New York NY USA ACM 2013 pp 337ndash348 [Online] Available httpdoiacmorg10114524636762465300

[34] S Gurajada S Seufert I Miliaraki and M Theobald ldquoTriad Adistributed shared-nothing rdf engine based on asynchronous messagepassingrdquo in Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo14 NewYork NY USA ACM 2014 pp 289ndash300 [Online] Availablehttpdoiacmorg10114525885552610511

[35] ldquoLargetriplestoresrdquo [Online] Available httpswwww3orgwikiLargeTripleStores

[36] A Dave A Jindal L E Li R Xin J Gonzalez and M ZaharialdquoGraphframes An integrated api for mixing graph and relationalqueriesrdquo in Proceedings of the Fourth International Workshop onGraph Data Management Experiences and Systems ser GRADES rsquo16New York NY USA ACM 2016 pp 21ndash28 [Online] Availablehttpdoiacmorg10114529604142960416

[37] ldquoGraphframesrdquo [Online] Available httpgraphframesgithubio[38] T Plantenga ldquoInexact subgraph isomorphism in mapreducerdquo J Parallel

Distrib Comput vol 73 no 2 pp 164ndash175 Feb 2013 [Online]Available httpdxdoiorg101016jjpdc201210005

[39] Z Zhao G Wang A R Butt M Khan V S A Kumar and M VMarathe ldquoSahad Subgraph analysis in massive networks using hadooprdquoin 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium May 2012 pp 390ndash401

[40] G M Slota and K Madduri ldquoComplex network analysis using parallelapproximate motif countingrdquo in Proc 28th IEEE Intrsquol Parallel andDistributed Processing Symposium (IPDPS) IEEE May 2014 pp405ndash414

[41] V T Chakaravarthy M Kapralov P Murali F Petrini X Que Y Sab-harwal and B Schieber ldquoSubgraph counting Color coding beyondtreesrdquo in 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) May 2016 pp 2ndash11

[42] J Gao C Zhou J Zhou and J X Yu ldquoContinuous pattern detectionover billion-edge graph using distributed frameworkrdquo in 2014 IEEE 30thInternational Conference on Data Engineering March 2014 pp 556ndash567

[43] Z Sun H Wang H Wang B Shao and J Li ldquoEfficientsubgraph matching on billion node graphsrdquo Proc VLDB Endowvol 5 no 9 pp 788ndash799 May 2012 [Online] Availablehttpdxdoiorg101477823119062311907

[44] S Ma Y Cao J Huai and T Wo ldquoDistributed graph pattern matchingrdquoin Proceedings of the 21st International Conference on World WideWeb ser WWW rsquo12 New York NY USA ACM 2012 pp 949ndash958[Online] Available httpdoiacmorg10114521878362187963

[45] A Fard M U Nisar L Ramaswamy J A Miller and M SaltzldquoA distributed vertex-centric approach for pattern matching in massive

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 15: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

graphsrdquo in 2013 IEEE International Conference on Big Data Oct 2013pp 403ndash411

[46] G Malewicz M H Austern A J Bik J C Dehnert I HornN Leiser and G Czajkowski ldquoPregel A system for large-scale graphprocessingrdquo in Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data ser SIGMOD rsquo10 NewYork NY USA ACM 2010 pp 135ndash146 [Online] Availablehttpdoiacmorg10114518071671807184

[47] J E Gonzalez R S Xin A Dave D Crankshaw M JFranklin and I Stoica ldquoGraphx Graph processing in a distributeddataflow frameworkrdquo in Proceedings of the 11th USENIX Conferenceon Operating Systems Design and Implementation ser OSDIrsquo14Berkeley CA USA USENIX Association 2014 pp 599ndash613[Online] Available httpdlacmorgcitationcfmid=26850482685096

[48] N Sundaram N Satish M M A Patwary S R Dulloor M JAnderson S G Vadlamudi D Das and P Dubey ldquoGraphmatHigh performance graph analytics made productiverdquo Proc VLDBEndow vol 8 no 11 pp 1214ndash1225 Jul 2015 [Online] Availablehttpdxdoiorg101477828099742809983

[49] S Hong S Depner T Manhardt J Van Der Lugt M Verstraatenand H Chafi ldquoPgxd A fast distributed graph processing enginerdquoin Proceedings of the International Conference for High PerformanceComputing Networking Storage and Analysis ser SC rsquo15 NewYork NY USA ACM 2015 pp 581ndash5812 [Online] Availablehttpdoiacmorg10114528075912807620

[50] F Zhu Q Qu D Lo X Yan J Han and P Yu ldquoMining top-k largestructural patterns in a massive networkrdquo Proceedings of the VLDBEndowment vol 4 no 11 pp 807ndash818 8 2011

[51] H Shang Y Zhang X Lin and J X Yu ldquoTaming verificationhardness An efficient algorithm for testing subgraph isomorphismrdquoProc VLDB Endow vol 1 no 1 pp 364ndash375 Aug 2008 [Online]Available httpdxdoiorg101477814538561453899

[52] J Cheng J X Yu B Ding P S Yu and H Wang ldquoFast graphpattern matchingrdquo in 2008 IEEE 24th International Conference on DataEngineering April 2008 pp 913ndash922

[53] M R Henzinger T A Henzinger and P W Kopke ldquoComputingsimulations on finite and infinite graphsrdquo in Proceedings of the 36thAnnual Symposium on Foundations of Computer Science ser FOCSrsquo95 Washington DC USA IEEE Computer Society 1995 pp 453ndash[Online] Available httpdlacmorgcitationcfmid=795662796255

[54] G Liu K Zheng Y Wang M A Orgun A Liu L Zhao and X ZhouldquoMulti-constrained graph pattern matching in large-scale contextualsocial graphsrdquo in 2015 IEEE 31st International Conference on DataEngineering April 2015 pp 351ndash362

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description
Page 16: PruneJuice: Pruning Trillion-edge Graphs to a Precise ...matei/papers/sc18.pdf · PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza 1; 2,

APPENDIX AARTIFACT DESCRIPTION

Here we present an example of searching a pattern in aR-MAT generated graph using our program The code isdeveloped on top of HavoqGT Detailed instructions on how touse our tool are available at httpsgithubcomLLNLhavoqgt

Clone the code from httpsgithubcomLLNLhavoqgtgitYou will require the latest releases of OpenMPI or

MAVPICH2 and the Boost library to run HavoqGT The codehas only been tested on latest generation of Linux distributionsOnce you have checked out the code make sure you are onthe master branch

Go to the directory buildquartz Setup CMake environmentby running the following script

scriptsquartzllnlgovdo cmakesh(Make necessary adjustments to the script for CMake to workwithin your environment)

The next step is to generate a graph in HavoqGT format Goto the directory buildquartz and build the R-MAT generator

make generate rmatCreate a directory eg usrgraph to store the generated

graphAssuming you are in a Slurm environment run the following

command to generate a R-MAT graph

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcgenerate rmat -s 18 -p 1 -f 1 -o devshmrmat -bursgraphrmat

This will create a graph with four partitions to be run usingfour MPI processes Note that this is a Scale 18 graph (Noticethe parameter for the -s flag) The mmapbinary graph file willbe store in usrgraph

Next we build the pattern matching executablemake run pattern matching betaWe will search the following Tree pattern on the graph we

just created The numeric values on each vertex is the labelof the respective vertex

2

4

3 7

5

3 7

Tree

We use degree information to create numeric vertex labelscomputed using the formula dlog2(d(vi)+1)e Here d(vi) isthe degree of a vertex vi

The input pattern is available at httpsgithubcomLLNLhavoqgt (See the instructions on the readme page regardinghow to download the sample input pattern)

See the instructions on the readme page regarding howcreate the output directory before you run the program(httpsgithubcomLLNLhavoqgt)

Once you have the input pattern eg usrpattern andoutput eg usrresults directories setup use the followingcommand to search the pattern stored in usrpattern

Note that we do not need to provide vertex labels for theTree pattern as we will use labels based on vertex degreeand the program will generate labels when no input label isprovided ie the -l flag is not set (The readme page has linksto other datasets that require input labels)

srun -N1 ndashntasks-per-node=4 ndashdistribution=blocksrcrun pattern matching beta -i devshmrmat -busrgraphrmat -p usrpattern -o usrresults

The program logs status information to the standard outputso you know the current state of the execution

See the instructions on the readme page regarding how tointerpret the results and retrieve the pruned graph You willfind scripts (written in python) that will help you to parse theresult files (httpsgithubcomLLNLhavoqgt)

Also instructions on how to enumerate a pattern on thepruned graph are available on the readme page (httpsgithubcomLLNLhavoqgt)

We encourage you to try the Tree pattern on a Scale 28 orlarger R-MAT graph and compare the results you obtain withthe numbers we have reported in Section V-A

The instructions page also explains how to partition a graph(from a given edge list) for distributed processing providerequired vertex labels and use the scripts to generate non-localconstraints for a given pattern

  • Introduction
  • Preliminaries
  • Graph Pruning for Scalable Matching
  • Asynchronous Distributed Algorithms
  • Evaluation
    • Weak Scaling Experiments
    • Strong Scaling Experiments
    • Match Enumeration
    • Example Use Case Social Network Analysis
    • Pruning PrecisionGuarantees vs Time-to-Solution
    • Impact of Design Decisions and Strategic Optimizations
    • Load Balancing
    • Comparison with Existing Work QFrag
      • Limitations Discussion
      • Related Work
      • Conclusion
      • References
      • Appendix A Artifact Description