Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27...

54
Large-Scale Graph Mining Vincent Leroy

Transcript of Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27...

Page 1: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

Large-Scale Graph Mining

Vincent Leroy

Page 2: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FREQUENT SUBGRAPH MINING (FSM)

• Graphs represent complex data

• Chemical compounds, proteins

• Social networks

• Knowledge bases (ontologies)

• Frequent Subgraph Mining

• Discover regularities in the structure of a graph

• Properties and interactions (citations graph, organization structure)

• Privacy (social networks)

• Link prediction (recommender systems, linked data)

10M entities, 120M facts

570M entities, 18B facts (2012)

2

Page 3: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FSM: CHEMISTRY

Caffeine moleculeC8H10N4O2

3

Page 4: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FSM: CHEMISTRY

Caffeine moleculeC8H10N4O2

3

Page 5: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FSM: KNOWLEDGE BASE

Jim Halpert

Dwight Schrute

Don Draper

Peggy Olson

Sterling Cooper

Dunder Mifflin Scranton

New York

located in

lives in

lives in

lives in

located in

works at

works at

works at

works at

4

Page 6: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FSM: KNOWLEDGE BASE

Jim Halpert

Dwight Schrute

Don Draper

Peggy Olson

Sterling Cooper

Dunder Mifflin Scranton

New York

located in

lives in

lives in

lives in

located in

works at

works at

works at

works at

4

Page 7: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FSM: DEFINITION

• Find all frequent (support ≥ ε) subgraphs

JH

DS

DDPO

SC

DM S

NY

DM S

P1

SC NY

Support(P1) = 2

2 em

bedd

ings

Input Graph

5

Page 8: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

SUPPORT DEFINITION

DM S

P2

SC NY

JH

DS

DD

PO

DM

SC

S

NY4 em

bedd

ingsJH

DS

DDPO

SC

DM S

NY

6

• Find all frequent (support ≥ ε) subgraphs

Page 9: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

SUPPORT DEFINITION

DM S

P2

SC NY

JH

DS

DD

PO

DM

SC

S

NY4 em

bedd

ings

Minimum Image Support: min(#mappings)Anti-monotony

JH

DS

DDPO

SC

DM S

NY

6

• Find all frequent (support ≥ ε) subgraphs

Page 10: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

SUPPORT DEFINITION

Support(P2) = 2

DM S

P2

SC NY

JH

DS

DD

PO

DM

SC

S

NY4 em

bedd

ings

Minimum Image Support: min(#mappings)Anti-monotony

JH

DS

DDPO

SC

DM S

NY

6

• Find all frequent (support ≥ ε) subgraphs

Page 11: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

CHALLENGE• Computing the support requires keeping track of embeddings

Input graph

AE

D C

BX

P

7

Pattern identified

Page 12: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

CHALLENGE• Computing the support requires keeping track of embeddings

Input graph

AE

D C

BX

P

• Up to factorial(V) embeddings for a single pattern due to symmetry(ex: 10! > 3M)

• Mining larger patterns and dealing with high-degree vertices is costly

7

Pattern identified

Page 13: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

CHALLENGE• Computing the support requires keeping track of embeddings

Input graph

AE

D C

BX

P

…60

em

bedd

ings A

AAAA

BBBCC

XXXXX

CDEBD

• Up to factorial(V) embeddings for a single pattern due to symmetry(ex: 10! > 3M)

• Mining larger patterns and dealing with high-degree vertices is costly

7

Pattern identified

Page 14: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

STATE OF THE ART

• Arabesque [SOSP 2015]Use more resources: parallel and distributed computation

• ScaleMine [SC 2016] Simplify the problem: compute a minimal set of embeddings to reach the support threshold ε • Lose accurate information on support which

is important for many applications

8

Page 15: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

CONTRIBUTIONS

• Address the core algorithmic and data structure problem of FSM with a new algorithm: SAMi

• Define 5 primitive operations to recursively enumerate patterns

• Propose a compressed representation of embeddings that circumvents the cost of enumerating embeddings

9

Page 16: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

OVERVIEW OF SAMi

Patterns sharing s-1 edges

Patterns of 1 edgeInit. 5 Primitives

Parent patterns(P1,P2)

Child pattern C of s edges

Canonicality test

Support test

10

Page 17: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

EMBEDDINGS REPRESENTATION

11

Page 18: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

INTUITIONP

G

There must be a more compact way to express this

AE

D C

B

AAAAB

BCDEA

XXXXX

X

12

Page 19: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

INTUITIONP

G

There must be a more compact way to express this

AE

D C

B

AAAAB

BCDEA

XXXXX

One of Then one ofThen

*No duplicates

ABCDE

ABCDE

X

X

12

Page 20: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

AUTOMATON REPRESENTATION OF EMBEDDINGS

s q1

A

BC

D

E

q2 f

A

BC

D

E

X

*No duplicates

• Deterministic finite automaton

• Alphabet: all vertex identifiers from the input graph

• Words accepted: embeddings of the pattern*

13

Page 21: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

AUTOMATON REPRESENTATION OF EMBEDDINGS

s q1

A

BC

D

E

q2 f

A

BC

D

E

X

*No duplicates

• Deterministic finite automaton

• Alphabet: all vertex identifiers from the input graph

• Words accepted: embeddings of the pattern*

We save memory, can we do more?13

Page 22: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PATTERN GENERATION

14

Page 23: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

15

Page 24: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

1

15

Page 25: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

2

1

15

Page 26: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

2

1

(1,2)15

Page 27: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3

2

1

(1,2)15

Page 28: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3

2

1

(1,2) ,(2,3)15

Page 29: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

Forward edgeBackward edge

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3

2

1

(1,2) ,(2,3)15

Page 30: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

Forward edgeBackward edge

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3

2

1

(1,2) ,(2,3) ,(3,1)15

Page 31: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

Forward edgeBackward edge

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3 4

2

1

(1,2) ,(2,3) ,(3,1)15

Page 32: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

Forward edgeBackward edge

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3 4

2

1

(1,2) ,(2,3) ,(3,1) ,(2,4)15

Page 33: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

Forward edgeBackward edge

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3 4

2

1

(1,2) ,(2,3) ,(3,1) ,(2,4)15

Page 34: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

Forward edgeBackward edge

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3 4

2

1

(1,2) ,(2,3) ,(3,1) ,(2,4)

1

4

2

3

(1,2),(2,3),(3,1),(3,4) 15

Page 35: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

Forward edgeBackward edge

PATTERN REPRESENTATION• Patterns generated recursively by adding edges

• Graph structure represented using DFS codes (gSpan, 2002) • Different codes can describe the same graph • Examples on unlabeled undirected graphs, but generalizable

3 4 CanonicalRepresentation

2

1

(1,2) ,(2,3) ,(3,1) ,(2,4)

1

4

2

3

(1,2),(2,3),(3,1),(3,4) 15

Page 36: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

RECURSIVE GENERATION• Canonical child pattern of s edges obtained

from 2 parents of s-1 edges

P1 e1…es−2,es−1

P2 e1…es−2,es'

C e1…es−2,es−1,es

1

1

4

2

2

3

3

16

Page 37: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

RECURSIVE GENERATION• Canonical child pattern of s edges obtained

from 2 parents of s-1 edges

P1 e1…es−2,es−1

P2 e1…es−2,es'

C e1…es−2,es−1,es

1

1

4

2

2

1

4

2 3

3

3

16

Page 38: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

GENERATION PRIMITIVES

Backward Forward

Backward BB-merge FB-merge

Forward BF-merge FF-merge Extension

Completeness: no canonical frequent pattern is missed

es-1

es

17

Page 39: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FB-MERGE

P2 (1,2),(1,3)

P1 (1,2),(2,3)2

1

3

2

1

3

18

Page 40: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FB-MERGE

P2 (1,2),(1,3)

P1 (1,2),(2,3)2

1

3

2

1

3

C (1,2),(2,3),(3,1)2

1

3

18

Page 41: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FB-MERGE

P2 (1,2),(1,3)

P1 (1,2),(2,3) es= swap(es’)2

1

3

2

1

3

C (1,2),(2,3),(3,1)2

1

3

18

Page 42: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FB-MERGE

P2 (1,2),(1,3)

P1 (1,2),(2,3) es= swap(es’)2

1

3

2

1

3

C (1,2),(2,3),(3,1)2

1

3

(v1,v2,v3) embedding of P1 and P2

(v1,v2,v3) embedding of C

18

Page 43: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FB-MERGE

P2 (1,2),(1,3)

P1 (1,2),(2,3) es= swap(es’)2

1

3

2

1

3

C (1,2),(2,3),(3,1)2

1

3

(v1,v2,v3) embedding of P1 and P2

(v1,v2,v3) embedding of C

Intersection of embeddings

18

Page 44: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FB-MERGE ON AUTOMATA OF EMBEDDINGS

s q1

A

Bf

B

C

• FB-merge: intersections of embeddings

• Generate an automaton that accepts WP1 ∩ WP2: product of automata O(#states2)

s

q2A

Bf

B

B

q3D

D

C

P2

P1

19

Page 45: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

FB-MERGE ON AUTOMATA OF EMBEDDINGS

s q1

A

Bf

B

C

• FB-merge: intersections of embeddings

• Generate an automaton that accepts WP1 ∩ WP2: product of automata O(#states2)

s

q2A

Bf

B

B

q3D

D

C

P2

P1

s

q1,2A

Bf

B

B

q1,3C

C

19

Page 46: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

AUTOMATA VALIDATION• Support computed directly from automata

• Mappings of vertex i are the labels of transitions at level i

• No duplicates rule, check that each transition has at least a valid path

s

q1,2A

Bf

C

B

q1,320

Page 47: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PRIMITIVES ON AUTOMATA: GENERALIZATION

• Each of the 5 primitives can be performed directly on automata

• O(#embeddings) becomes O(#automaton states2)

• Compact automata lead to huge gains

• Minimization: Revuz’s algorithm

21

SAMi is complete: all frequent patterns are generated in their canonical representation

Page 48: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

EXPERIMENTS

22

Page 49: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

SETUP• Datasets

• Citeseer: 3k vertices, 5k edges

• Patents: 2M vertices, 13M edges

• Yago: 2M vertices, 4M edges

• Parameters

• Pattern complexity (#edges)

• Support threshold (ε)

• Measures

• Mining time

• #embeddings / automata size

23

Page 50: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PERFORMANCE: CITESEER

24

100

101

102

103

104

3 4 5 6 7 8 9 10

Tim

e [

sec]

Maximum number of edges

SAMiScaleMineArabesque

Page 51: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

EMBEDDINGS COMPRESSION

102103104105106107108109

1010

2 3 4 5 6 7 8

Co

un

t

Number of vertices

embeddingstransitions

states

25

Number of vertices in pattern

#em

bedd

ings

/tran

sitio

ns/s

tate

s

Page 52: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PERFORMANCE: PATENTS

5h

26

101

102

103

104

105

16k 18k 20k 22k 24k 26k 28k

Tim

e [se

c]

Support Threshold ε

SAMiScaleMineArabesque

Page 53: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

PERFORMANCE: YAGO

27

SIGMOD’18, June 2018, Houston, Texas USA double blind review

101

102

103

104

105

16k 18k 20k 22k 24k 26k 28kT

ime

[se

c]

Support Threshold ε

SAMiScaleMineArabesque

Figure 10: Mining time for Patents

Max. #edges = 3 4 5� = 2 0:02:47 1:14:57 3:44:11� = 10 0:02:28 1:14:49 2:52:21� = 100 0:02:26 1:14:26 2:35:28� = 1000 0:02:07 1:11:19 2:13:05Table 2: Mining time for Yago2s (H:MM:SS)

of vertices increases by 1, the number of embeddings increases bya factor of 12 to 15. This is consistent with the increase in execu-tion time of Arabesque reported in Figure 8. The execution cost ofSAMi however depends on the size of automata, which increases ata much lower rate (between 2 and 3). This is also consistent withSAMi’s execution times on this dataset. This experiment show thatSAMi’s automaton-based representation of embeddings achieves itsgoal in preserving a concise data structure throughout the recursivepattern generation process.

Varying the support threshold. In Figure 10, we switch to thePatents dataset and vary the support threshold, without settinga limitation on the number of edges in patterns. Arabesque andScaleMine have similar performance for �=28k. This is because, forthat threshold, frequent patterns contain at most 3 vertices. How-ever, at �=24k, patterns with 4 vertices are found, and Arabesque’sexecution time increases by a factor of 20. At �=15k, patterns con-tain up to 11 vertices. Nonetheless, the combinatorial explosionin number of embeddings does not a�ect SAMi, as the compressedrepresentation avoids their enumeration. SAMi is even between 5(�=24k) and 22 (�=15k) times faster than ScaleMine on this exper-iment, despite the fact that it has the additional requirement ofcomputing exact supports. With higher support thresholds and alarger graph, ScaleMine has to �nd more embeddings than in thecase of Citeseer, which increases its execution time. Comparatively,SAMi’s primitives on automata are more e�cient.

6.4 Detailed evaluation of SAMiWe now focus on SAMi, and evaluate its performance on a speci�ctype of data: knowledge graphs. Then, we detail the executionof SAMi to better understand the usage of primitives. Finally, weevaluate the scalability of SAMi in the cluster con�guration.

Performance on knowledge graphs. Knowledge graphs, such asYago, are particularly challenging because they contain star-shapedsubgraphs. AMIE+ [14] and OP [7] specialize in �nding graph-based

102

103

104

105

4 5 6 7 8

Tim

e [

sec]

Maximum number of edges

ε=2ε=10

ε=100ε=1000

Figure 11: Mining time for Yago2Core

association rules in these datasets. They include speci�c optimiza-tions to avoid computing patterns that are unlikely to lead to inter-esting rules. However, they su�er from the same scalability issueas Arabesque, since they enumerate all embeddings. In practice,OP [7] produces rules of 3 vertices, and AMIE+ [14] produces rulesof 3 vertices on Yago2s and 4 on Yago2Core. Table 2 and Figure 11indicate the running times of SAMi for these datasets. In the case ofYago2s, SAMi �nds all patterns of size 3 in less than 3 minutes, evenwith extremely low support thresholds (�=2, giving 7,772 patterns).For patterns of size 5, SAMi takes up to 3 hours 45 minutes, which isstill reasonable. This generates over a million patterns at the lowestsupport threshold. While it is unlikely that an analyst would beinterested in the full list of results, these patterns can be furtherre�ned into association rules, and �ltered based on their accuracy.For Yago2Core, SAMi reaches from 6 to 8 edges depending on thesupport threshold. Overall, this experiment shows that SAMi canaccelerate association rules mining algorithms, with the potentialto �nd more accurate association rules than existing approachesby including more edges in patterns.

Execution analysis. We analyze the execution of SAMi on Patentsand Yago2Core, and present the number of invocations of each prim-itive in Tables 3 and 4. In both cases, EE-merge and EE-extensionare frequently used, while EI-merge is frequent for Yago2Core butnot for Patents. As explained in Section 3.2, SAMi sometimes com-putes the embeddings corresponding to some non-canonical codesending with an external edge. This is overhead, as SAMi also com-putes the embeddings corresponding to the canonical version ofthe same pattern. However, this computation only takes place if theembeddings are actually used to reach a canonical result. The directcolumn reports the number of invocation of a primitive to computethe support of a pattern represented by its canonical code. The lazycolumn indicates how many non-canonical DFS codes are consid-ered throughout the enumeration, and the lazy computed columnindicates how many times the embeddings of a non canonical codeare actually computed. The analysis of the execution clearly showsthat, in practice, this is very rare, so the overhead is extremelylimited. In the case of Patents, the support of 7,333 canonical pat-terns is computed, out of which 1,166 are frequent (16%). This isbecause SAMi has to reach non frequent patterns to stop the recur-sion using the anti-monotony property of MI support. In the caseof Yago2Core, this ratio is higher, at 47%, because the recursion alsostops on a frequent pattern when reaching 5 edges. Overall, SAMi

Ontological Pathfinding [SIGMOD16]AMIE+ [VLDBJ] Max #edges=3

Page 54: Large-Scale Graph Mining - GREYC › projects › decade › ... · PERFORMANCE: YAGO 27 SIGMOD’18, June 2018, Houston, Texas USA double blind review 101 102 103 1044 10

GRAPH MINING: CONCLUSION

• Addresses the fundamental problems of FSM

• Pattern generation process (5 primitives)

• Compressed representation of embeddings

• Three orders of magnitude faster than state of the art

• Opens new possibilities: knowledge graph mining

• Qualitative evaluation of mining outcome28