Memoirs of a Graph Addict:
Despair to Redemption
Marko A. RodriguezGraph Systems Architect
http://markorodriguez.com
http://twitter.com/twarko
Winter Whirlwind Tour – Chicago to Malmo – January 10-14, 2011
January 8, 2011
Abstract
A graph database provides a means of linking together objects using directreferences. In other words, in order to determine if one object is adjacentto another, no index lookup is required. In contrast to relational databases,in a graph database, there is no notion of a join operation as the graph isalready an explicitly joined structure. Given a graph, problems are solvedusing graph traversals–that is, directed walks over the objects and relationsthat compose the graph. This lecture has three primary points ofdiscussion. The first is a description of graph database technology. Thesecond, a memoir of the speaker’s applied and theoretical work withgraphs. The third and final point, a review of an open source graphprocessing stack currently being developed by AT&T Interactive and itscollaborators.
For 10 years now, I’ve dealt with a painful graph addiction...Let me share my story with you.
Outline
• Graph Structures
• Graph Databases
• Graph Applications
• TinkerPop Product Suite
Outline
• Graph Structures
• Graph Databases
• Graph Applications
• TinkerPop Product Suite
Graph Data Structure Pieces: Part 1
id vertex (thing, object, dot)
edge (relation, join, line)
element}
Single-Relational Graph
marko peter
tinkerpop
neotech
neo4j
blueprintsgremlin
In single-relational graphs, things are related. Unfortunately, not a very useful structure
for most domain modeling situations. Relatedness is too generic—all edges have the
same meaning.
Graph Data Structure Pieces: Part 2
id
label
vertex (thing, object, dot)
edge (relation, join, line)
element}
Multi-Relational Graph
marko peter
tinkerpop
neotech
knows
member member
member
created
neo4j
blueprints
createdcreated
gremlin
knows
imports
imports
By adding labels to the edges, its possible to denote the type of relation that exists
between any two vertices. Now its possible to denote different types of things and the
different ways in which they relate to one another.
Graph Data Structure Pieces: Part 3
id
label
key1=value1key2=value2
vertex (thing, object, dot)
edge (relation, join, line)
property (key/value, attribute)key=value
property map
element}
Property Graph
marko peter
tinkerpop
neotech
knows
member member
member
created
neo4j
blueprints
createdcreated
gremlin
knows
imports
imports
lang=javause=traverse
lang=javause=api
date=2009
lang=javause=graphdb
date=2009
Allow elements to have key/value properties. In particular, very useful for further
specifying the meaning of an edge. “When did TinkerPop create Gremlin?”
Numerous Graph Types
http://ex.com/123
a
0.2
knows
mul
ti
weighted
directed
vertex-labeled
name=emiltype=person
vertex-attributed
created=2-01-09modified=2-11-09edge-attributed
hyper
pseudo
resource description framework
half-
edge
hired
simple
edge-labeled
sem
antic undirected
Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” Bulletin of the American Society for Information Science
and Technology, 36(6), pp. 35-41, 2010. [http://arxiv.org/abs/1006.2361]
Property Graph as a Rich Structure
property graph
weighted graph
semantic graph
multi-graph
undirected graph
directed graph
simple graph
add weight attribute
remove attributes
remove edge labels
remove loops, directionality, and multiple edges
no op
no op
no op
no op
remove directionality
remove attributes
labeled graph
remove edge labels
no op
rdf graph
make labels URIs
A fun related thought: Rodriguez, M.A., “Mapping Semantic Networks to Undirected Networks,” International Journal of
Applied Mathematics and Computer Sciences, 4(1), pp. 39–42, 2009. [http://arxiv.org/abs/0804.0277]
Graph Algorithms in Single-Relational Graphs
• Most graph algorithms are designed for single-relational graphs.1
? Geodesic: shortest path, eccentricity, diameter, closeness centrality,betweenness centrality, etc.
? Eigenvector: spreading activation, pagerank, eigenvector centrality,etc.
? Assortative: scalar, assortative, etc.
1Excellent book reviewing numerous graph algorithms: Brandes U., Erlebach, T., “Network Analysis:Methodological Foundations,” Springer, 2005.
Graph Algorithms in Multi-Relational+ Graphs• Most real-world software systems require multi-relational+ graphs. E.g.:
Who are the most central coauthors when all I know is wrote?
wrotewrotewrotewrote wrote wrote
coauthorcoauthor
• A key concept when evaluating graph algorithms over multi-relational+graphs is implicit adjacency/path descriptions/virtual edges/etc.2
2Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network AnalysisAlgorithms,” Journal of Informetrics, 4(1), pp. 29–41, 2009. [http://arxiv.org/abs/0806.2274]
Outline
• Graph Structures
• Graph Databases
• Graph Applications
• TinkerPop Product Suite
The Simplicity of a Graph
• A graph is a simple data structure.
• A graph states that something is related to something else (the foundationof any other data structure).3
• It is possible to model a graph in various types of databases.4
? Relational database: MySQL, Oracle, PostgreSQL
? JSON document database: MongoDB, CouchDB
? XML document database: MarkLogic, eXist-db
? etc.
3A graph can be used to represent other data structures. This point becomes convenient when lookingbeyond using graphs for typical, real-world domain models (e.g. friends, favorites, etc.), and seeing theirapplicability in other areas such as modeling code (e.g. http://arxiv.org/abs/0802.3492), indices, etc.
4For the sake of diagram clarity, the examples to follow are with respect to a single-relational, directedgraph. Note that it is possible to model multi-relational graphs in these types of database as well.
Representing a Graph in a Relational Database
outV | inV
------------
A | B
A | C
C | D
D | A
A
CB
D
Representing a Graph in a JSON Database
{
A : {
outE : [B, C]
}
B : {
outE : []
}
C : {
outE : [D]
}
D : {
outE : [A]
}
}
A
CB
D
Representing a Graph in an XML Database
<graphml>
<graph>
<node id=A />
<node id=B />
<node id=C />
<node id=D />
<edge source=A target=B />
<edge source=A target=C />
<edge source=C target=D />
<edge source=D target=A />
</graph>
</graphml>
A
CB
D
Defining a Graph Database
“If any database can represent a graph, then what
is a graph database?”
Defining a Graph Database
A graph database is any storage system thatprovides index-free adjacency.
Defining a Graph Database by Example
D
E
C
A
B
Toy Graph Gremlin(stuntman)
Graph Databases and Index-Free Adjacency
D
E
C
A
B
• Our gremlin is at vertex A.
• In a graph database, vertex A has direct references to its adjacent vertices.
• Constant time cost to move from A to B and C. It is dependent upon the number
of edges emanating from vertex A (local).
Graph Databases and Index-Free Adjacency
D
E
C
A
B
The Graph (explicit)
Graph Databases and Index-Free Adjacency
D
E
C
A
B
The Graph (explicit)
Non-Graph Databases and Index-Based Adjacency
D
E
C
A
B
A B C
D EB,C E D,E
• Our gremlin is at vertex A.
Non-Graph Databases and Index-Based Adjacency
D
E
C
A
B
A B C
D EB,C E D,E
• In a non-graph database, the gremlin needs to look at an index to determine whatis adjacent to A.
• log(n) time cost to move to B and C. It is dependent upon the total number of
vertices and edges in the database (global).
Non-Graph Databases and Index-Based Adjacency
D
E
C
A
B
A B C
D EB,C E D,E
The Index (explicit) The Graph (implicit)
Non-Graph Databases and Index-Based Adjacency
D
E
C
A
B
A B C
D EB,C E D,E
The Index (explicit) The Graph (implicit)
Index-Free Adjacency
• While any database can implicitly represent a graph, only agraph database makes the graph structure explicit.5
• In a graph database, each vertex serves as a “mini index”of its adjacent elements.6
• Thus, as the graph grows in size, the cost of a local stepremains the same.7
5Please see http://markorodriguez.com/Blarko/Entries/2010/3/29_MySQL_vs._Neo4j_on_a_
Large-Scale_Graph_Traversal.html for some performance characteristics of graph traversals in arelational database (MySQL) and a graph database (Neo4j).
6Each vertex can be intepreted as a “parent node” in an index with its children being its adjacentelements. In this sense, traversing a graph is analogous in many ways to traversing an index—albeit thegraph is not an acyclic connected graph (tree). (a vision espoused by Craig Taverner)
7A graph, in many ways, is like a distributed index.
Graph Query = Graph Traversal
• Graph databases are optimized for graph-theoretic operations
(e.g. graph traversals).
• Graph databases are not optimized for set-theoretic
operations (e.g. union, intersection, theta-join).
• The graph traversal pattern:8
? Given some root set of elements, traverse in X fashionto yield some side-effect and/or destination.
8Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniquesand Applications, eds. S. Sakr, E. Pardede, IGI Global, 2011. http://arxiv.org/abs/1004.1001
Outline
• Graph Structures
• Graph Databases
• Graph Applications
• TinkerPop Product Suite
Adventures in Graphlandia
My graph disease first started in 2001 and it’s only progressed since...
• Collective decision making: graph-based voting.
• Eudaemonic engine: graph-based recommendation.
• Universal computer: graph-based computing.
Collective Decision Making: Fall of the Modern World
The year is 2014.
Oil production has dropped significantly. Any reserves that are left are tooexpensive to purchase. Nations can not transport food.9
Regions with poor agriculture yield famine.
9Peak oil available at http://en.wikipedia.org/wiki/Peak_oil.
People are in shock, fear, and panic over the fall ofthe modern world.
The world sees a 75% drop in human population.
The technology and knowledge of the modern worldstill exists.
The social infrastructure doesn’t....A few rise tocreate a new world order.10
10Watkins, J.H., M.A. Rodriguez, “A Survey of Web-Based Collective Decision Making Systems,” Studiesin Computational Intelligence: Evolution of the Web in Artificial Intelligence Environments, eds. R. Nayak,N. Ichalkaranje, and L.C. Jain, pp. 245-279, 2008. [http://escholarship.org/uc/item/04h3h1cr]
Collective Decision Making: Rise of the Machines
Four strong, brave men begin thejourney to stability. Decisionsneed to be made regarding howto determine and execute socialgoals. The distributed collective ofTinkerPop is created.
• Marko Rodriguez (former USA)
• Peter Neubauer (former Sweden)
• Josh Shinavier (former China)
• Pavel Yaskevich (former Belarus)
marko
josh
pavel
peter
Collective Decision Making: Rise of the Machines
Direct DemocracyDynamically Distribute
Democracy
marko
josh pavel
peter
Two examples will be presented for the same decision making scenario. One using direct
democracy as the aggregation algorithm and one using dynamically distributed
democracy as the aggregation algorithm.11
11Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective DecisionMaking Systems Perspective,” First Monday, 14(8), 2009. [http://arxiv.org/abs/0901.3929]
Collective Decision Making: Direct Democracy
• “What percentage of our cropyield should we store asreserves?”
• The outcome is represented as areal value in [0, 1].
• Each individual has their opinionof the situation.
? Marko (80% should be stored.)
? Peter (50% should be stored.)
? Josh (80% should be stored.)
? Pavel (90% should be stored.)
marko0.8
josh0.8
pavel0.9
peter0.5
Collective Decision Making: Direct Democracy
• In a direct democracy, every onevoices their opinion.
• The average of all voiced opinionsis the final decision (even in binarydecisions).
• For our society of 4, a pure directdemocracy would yield(0.8+ 0.5+ 0.8+ 0.9)/4 = 0.75.
marko0.8
josh0.8
pavel0.9
peter0.5
Collective Decision Making: Direct Democracy
• If an individual abstains fromparticipation, then their opinionis not considered.
• Assume only Peter and Pavel arethere to participate. Marko andJosh are out hunting.
• For our society of 4 (with 2voters), a pure direct democracywould yield(0.5 + 0.9)/2 = 0.7.|0.75− 0.7| = 0.05 error.
marko0.8
josh0.8
pavel0.9
peter0.5
Collective Decision Making: Representative Democracy
• Thomas Paine stated that when populations are small “some convenienttree will afford them a State house”, but as the population increases itbecomes a necessity for representatives to “act in the same manner asthe whole body would act were they present.”12 13
12Paine, T., “Common Sense,” 1776.13The role of the representative as an expert vs. a model is argued at length in Pitkin, H.F., “The
Concept of Representation,” University of California Press, 1972.
Collective Decision Making: DDD
• Dynamically distributed democracy (DDD) strikes a balance betweendirect and representative democracy.
• An individual is at least a representative of themselves.
• An individual can also yield the power of those that abstain fromparticipation.
• Dynamically distributing representative power is the purpose of thealgorithm.
Collective Decision Making: DDD
• Peter believes that Josh andMarko are good decision makers.
• When Peter abstains, Markoand Josh yield his social powerin equal parts (0.5).
• Like a friendship graph, but theedges denote “trust.”
? “I believe that X has identical values
to me and will behave as I do.”
? “I believe that X is more expert than
I and should make decisions.”
marko
josh
pavel
peter
0.5
0.5
Collective Decision Making: DDD
• Marko believes Josh is the key tohumanity.
• Josh prefers people closer to hiseastern home of former China.
• Pavel is of the former SovietUnion, and simply has no faithin anyone.
marko
josh
pavel
peter
0.5
0.5
1.0
0.75
0.25
Collective Decision Making: DDD
marko
josh
pavel
peter
0.5
0.5
1.0
0.75
0.25
This is the trust-based social graph. Individuals can add/removeoutgoing edges from their vertex as they please. When decisions arerequired, the current snapshot of the graph is used to compute thecollective decision.
Collective Decision Making: DDD
• In a dynamically distributeddemocracy, every can voice theiropinion.
• The weighted average of allvoiced opinions is the finaldecision.
• For our society of 4, a pure directdemocracy would yield(0.8+ 0.5+ 0.8+ 0.9)/4 = 0.75.
• When everyone participates,its a direct democracy.
marko
josh
pavel
peter
0.5
0.5
1.0
0.75
0.25
Collective Decision Making: DDD
• Assume Marko and Josh gohunting, again. By abstaining,they diffuse their vote powerover their outgoing edges.
• By participating, Peter andPavel aggregate vote powerthrough their incoming edges.
• This diffusion process continuesuntil all power has aggregated atparticipating individuals.
marko0.8
josh0.8
pavel0.9
peter0.5
0.5
0.5
1.0
0.75
0.25
1.0
1.0
1.0
1.0
Collective Decision Making: DDD
• Note that Marko fully trusts Joshdecision making abilities.
• However, given that Josh is notparticipating, Marko is implicitlystating that he trusts Josh’sdecision in choosing decisionmakers.
• Thus, Josh serves to routeMarko’s power.
marko0.8
josh0.8
pavel0.9
peter0.5
0.5
0.5
1.0
0.75
0.25
1.0
1.75
1.25
Collective Decision Making: DDD
• In the end, Peter and Pavelhave aggregated all the energyin the graph (albeit, to differentdegrees).
• Now a weighted direct democracyis used to calculate the collectivedecision.
• The collective vote is((1.5 ·0.5)+(2.5 ·0.9))/4 = 0.75.|0.75− 0.75| = 0.0 error.
marko0.8
josh0.8
pavel0.9
peter0.5
0.5
0.5
1.0
0.75
0.25
2.5
1.5
Collective Decision Making: DDD
percentage of active citizens
error
100 90 80 70 60 50 40 30 20 10 0
0.00
0.05
0.10
0.15
0.20
dynamically distributed democracydirect democracy
4
percentage of active citizens
pro
port
ion o
f corr
ect decis
ions
100 90 80 70 60 50 40 30 20 10 0
0.50
0.65
0.80
0.95
dynamically distributed democracy
direct democracy
(n)
Fig. 5. The relationship between k and evotek for direct democracy (gray
line) and dynamically distributed democracy (black line). The plot providesthe proportion of identical, correct decisions over a simulation that was runwith 1000 artificially generated networks composed of 100 citizens each.
As previously stated, let x ! [0, 1]n denote the politicaltendency of each citizen in this population, where xi is thetendency of citizen i and, for the purpose of simulation, isdetermined from a uniform distribution. Assume that everycitizen in a population of n citizens uses some social network-based system to create links to those individuals that theybelieve reflect their tendency the best. In practice, these linksmay point to a close friend, a relative, or some public figurewhose political tendencies resonate with the individual. Inother words, representatives are any citizens, not politicalcandidates that serve in public office. Let A ! [0, 1]n!n denotethe link matrix representing the network, where the weight ofan edge, for the purpose of simulation, is denoted
Ai,j =
!1 " |xi " xj | if link exists0 otherwise.
In words, if two linked citizens are identical in their politicaltendency, then the strength of the link is 1.0. If their tendenciesare completely opposing, then their trust (and the strength ofthe link) is 0.0. Note that a preferential attachment networkgrowth algorithm is used to generate a degree distribution thatis reflective of typical social networks “in the wild” (i.e. scale-free properties). Moreover, an assortativity parameter is usedto bias the connections in the network towards citizens withsimilar tendencies. The assumption here is that given a systemof this nature, it is more likely for citizens to create links tosimilar-minded individuals than to those whose opinions arequite different. The resultant link matrix A is then normalizedto be row stochastic in order to generate a probability distribu-tion over the weights of the outgoing edges of a citizen. Figure6 presents an example of an n = 100 artificially generatedtrust-based social network, where red denotes a tendency of0.0, purple a tendency of 0.5, and blue a tendency of 1.0.
Given this social network infrastructure, it is possible to bet-ter ensure that the collective tendency and vote is appropriatelyrepresented through a weighting of the active, participatingpopulation. Every citizen, active or not, is initially provide with
Fig. 6. A visualization of a network of trust links between citizens. Eachcitizen’s color denotes their “political tendency”, where full red is 0, full blueis 1, and purple is 0.5. The layout algorithm chosen is the Fruchterman-Reingold layout.
1n “vote power” and this is represented in the vector ! ! Rn
+,such that the total amount of vote power in the population is1. Let y ! Rn
+ denote the total amount of vote power that hasflowed to each citizen over the course of the algorithm. Finally,a ! {0, 1}n denotes whether citizen i is participating (ai = 1)in the current decision making process or not (ai = 0). Thevalues of a are biased by an unfair coin that has probability kof making the citizen an active participant and 1"k of makingthe citizen inactive. The iterative algorithm is presented below,where # denotes entry-wise multiplication and " $ 1.
! % 0while
"i"ni=1 yi < " do
y % y + (! # a)! % ! # (1 " a)! % A!
end
In words, active citizens serve as vote power “sinks” inthat once they receive vote power, from themselves or froma neighbor in the network, they do not pass it on. Inactivecitizens serve as vote power “sources” in that they propagatetheir vote power over the network links to their neighborsiteratively until all (or ") vote power has reached activecitizens. At this point, the tendency in the active populationis defined as #tend = x · y. Figure 4 plots the error incurredusing dynamically distributed democracy (black line), wherethe error is defined as
etendk = |dtend
100 " #tendk |.
Next, the collective vote #votek is determined by a weighted
majority as dictated by the vote power accumulated by activeparticipants. Figure 5 plots the proportion of votes that aredifferent from what a fully participating population would
• As participation wanes, dynamicallydistributed democracy is able tosimulate direct democracy.14
14Rodriguez, M.A., Steinbock, D.J., “A Social Network for Societal-Scale Decision-MakingSystems,” Proceedings of the Computational Social and Organizational Science Conference, 2004.[http://arxiv.org/abs/cs/0412047]
Collective Decision Making: Techno-Government
• In this model of decision making, there is no governmental body.
• Power is determined when a decision is needed.
• How are bills created? Wikilegislature?15
• What about different types of trust (e.g. “Marko trusts Josh inengineering decisions only.”) — Hint: Multi-relational+ graphs. Tagginglegislature and tagging trust.16
15Turoff, M., Roxanne-Hiltz, S., Bieber, M., Rana, A., “Collaborative Discourse Structures in ComputerMediated Group Communications”, Hawaii International Conference on Systems Science (HICSS), 1998.[http://web.njit.edu/~turoff/Papers/CDSCMC/CDSCMC.htm]
16Rodriguez, M.A., “Social Decision Making with Multi-Relational Networks and Grammar-BasedParticle Swarms,” Hawaii International Conference on Systems Science (HICSS), pp. 39–49, 2007.[http://arxiv.org/abs/cs/0609034]
“The founders of modern democracies provided a moral heritage thatremains highly regarded in societies today. However, it should beremembered that it is the ideals that are valuable, not the specificimplementation of the systems that protect and support them. Ifthere is another implementation of government that better realizesthese ideals, then, by the rights of man, it must be enacted.”17
– Michael Scott
17Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective DecisionMaking Systems Perspective,” First Monday, 14(8), University of Illinois at Chicago Library, 2009.[http://arxiv.org/abs/0901.3929]
Eudaemonic Engine: Seeking Virtue through Circuitry
The year is 2018.
Human life on earth has stabilized.
Humans no longer struggle to survive. Theystruggle for eudaemonia. They seek the “gooddaemon” within...
Eudaemonic Engine: Artistotle
• Being virtuous is repeatedly choosing correctly.
• Habitual correct behavior leads to eudaemonia – complete engagement in the world
(a complete sense of engagement/acceptance).18 19
• Can systems aid individuals in choosing correctly – in all aspects of life?
David L. NortonAristotle
18Aristotle, “Nicomachean Ethics”, 350 B.C.
19Mihaly Csikszentmihalyi, “Flow: The Psychology of Optimal Experience”, Harper Perennial, 1990.
Eudaemonic Engine: Resource Modeling
But if the development of character is a the moral objective, it is obvious that
[...] the choices of vocation and avocations to pursue, of friends to cultivate, of
books to read are moral for they clearly influence such development.20
• Web services are continuing to build richer models of humans, resources,and the relationships between them.
• There exists an increasing reliance on such services to aid in decisionmaking: correct books (Amazon.com), correct movies (NetFlix.com),correct music (Pandora), correct occupation (Monster.com), correctfriends (PointsCommuns.com), correct life partner (Match.com), etc.21
20David L. Norton, “Democracy and Moral Development: A Politics of Virtue”, University of California Press, 1991.
21Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,” Proceedings of the
International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, 5712, pp. 813–820, 2009.
[http://arxiv.org/abs/0904.0027]
Eudaemonic Engine: Mapping Person to Resource
person
movie
article
music
friend
food
watch
read
listen
meet
eat
time
Map an individual to actions on resources. However, how do wemodel/expose the resources of the world?
Model
Eudaemonic Engine: The Web of Data
geospecies
freebase
dbpedia
libris
geneid
interpro
hgnc
symbol
pubmed
mgi
geneontology
uniprot
pubchem
unists
omim
homologene
pfam
pdb
reactome
chebi
uniparc
kegg
cas
uniref
prodomprosite
taxonomy
dailymed
linkedct
acm
dblprkbexplorer
laascnrs
newcastle
eprints
ecssouthampton
irittoulouseciteseer
pisa
resexibm
ieee
rae2001
budapestbme
eurecom
dblphannover
diseasome
drugbank
geonames
yago
opencyc
w3cwordnet
umbel
linkedmdb
rdfbookmashup
flickrwrappr
surgeradio
musicbrainz myspacewrapper
bbcplaycountdata
bbcprogrammes
semanticweborg
revyu
swconferencecorpus
lingvoj
pubguide
crunchbase
foafprofiles
riese
qdos
audioscrobbler
flickrexporter
bbcjohnpeel
wikicompany
govtrack
uscensusdata
openguides
doapspace
bbclatertotp
eurostat
semwebcentral
dblpberlin
siocsites
jamendo
magnatuneworldfactbook
projectgutenberg
opencalais
rdfohloh
virtuososponger
geospecies
freebase
dbpedia
libris
geneid
interpro
hgnc
symbol
pubmed
mgi
geneontology
uniprot
pubchem
unists
omim
homologene
pfam
pdb
reactome
chebi
uniparc
kegg
cas
uniref
prodomprosite
taxonomy
dailymed
linkedct
acm
dblprkbexplorer
laascnrs
newcastle
eprints
ecssouthampton
irittoulouseciteseer
pisa
resexibm
ieee
rae2001
budapestbme
eurecom
dblphannover
diseasome
drugbank
geonames
yago
opencyc
w3cwordnet
umbel
linkedmdb
rdfbookmashup
flickrwrappr
surgeradio
musicbrainz myspacewrapper
bbcplaycountdata
bbcprogrammes
semanticweborg
revyu
swconferencecorpus
lingvoj
pubguide
crunchbase
foafprofiles
riese
qdos
audioscrobbler
flickrexporter
bbcjohnpeel
wikicompany
govtrack
uscensusdata
openguides
doapspace
bbclatertotp
eurostat
semwebcentral
dblpberlin
siocsites
jamendo
magnatuneworldfactbook
projectgutenberg
opencalais
rdfohloh
virtuososponger
Eudaemonic Engine: URIs of the Web of Data
http://dbpedia.org/resource/The Fountainhead
dbpedia:Ayn_Rand
dbpedia:Fountain_Head
flickr:Ayn_Rand
dbpedia:Bookdbpedia:author
rdf:type
dbpprop:hasPhotoCollection
http://www4.wiwiss.fu-berlin.de/flickrwrappr/photos/Ayn_Rand
foaf:depiction
DBPEDIA
FLICKR
Eudaemonic Engine: Datasets on the Web of Datadata set domain data set domain data set domain
audioscrobbler music govtrack government pubguide booksbbclatertotp music homologene biology qdos socialbbcplaycountdata music ibm computer rae2001 computerbbcprogrammes media ieee computer rdfbookmashup booksbudapestbme computer interpro biology rdfohloh socialchebi biology jamendo music resex computercrunchbase business laascnrs computer riese governmentdailymed medical libris books semanticweborg computerdblpberlin computer lingvoj reference semwebcentral socialdblphannover computer linkedct medical siocsites socialdblprkbexplorer computer linkedmdb movie surgeradio musicdbpedia general magnatune music swconferencecorpus computerdoapspace social musicbrainz music taxonomy referencedrugbank medical myspacewrapper social umbel generaleurecom computer opencalais reference uniref biologyeurostat government opencyc general unists biologyflickrexporter images openguides reference uscensusdata governmentflickrwrappr images pdb biology virtuososponger referencefoafprofiles social pfam biology w3cwordnet referencefreebase general pisa computer wikicompany businessgeneid biology prodom biology worldfactbook governmentgeneontology biology projectgutenberg books yago generalgeonames geographic prosite biology . . .
Eudaemonic Engine: Transforms Development
A new application development paradigm emerges. No longer do data and application
providers need to be the same entity (left). With the Web of Data, its possible for
developers to write applications that utilize data that they do not maintain (right).22
Web of Data
127.0.0.1 127.0.0.2 127.0.0.3
Application 1 Application 2 Application 3
structures structuresstructures
processes processes processes
127.0.0.1 127.0.0.2 127.0.0.3
Application 1 Application 2 Application 3
structures structures structures
processes processes processes
22Rodriguez, M.A., “A Reflection on the Structure and Process of the Web of Data,” Bulletin of the American Society for
Information Science and Technology, 35(6), pp. 38–43, 2009. [http://arxiv.org/abs/0908.0373]
Now that there is a rich structure, what is theprocess?
Process
Eudaemonic Engine: Diffusion Processes on Graphs
A graph diffusion process will be used to determine the solution to one’sproblems.
• Graph traversing can be seen as a diffusion process over a graph.
• “Energy” moves over a graph and reverberates in regions where thereis recurrence (i.e. cycles).
• At some t in the future, the vertices with the greatest flow are thesolution to the problem.
Eudaemonic Engine: Diffusion Processes on Graphs
Eudaemonic Engine: Diffusion Processes on Graphs
Eudaemonic Engine: Diffusion Processes on Graphs
Eudaemonic Engine: Diffusion Processes on Graphs
Eudaemonic Engine: Diffusion Processes on Graphs
Implementing a diffusion process is easy when the edges of thegraph are unlabeled.
flow = new HashMap<Vertex,Integer>();
current = Arrays.asList(startVertex);
steps = 10;
for(int i=0; i<steps; i++) {
current = current.collect{ it.getAdjacentVertices() }
current.each{ flow[it] = flow[it] + 1 }
}
Eudaemonic Engine: Diffusion on a Property Graph?
marko
24
jen
The Wire
linkedprocess
intelligence graphs
peter
occupationoccupation
likes wrote
occupation
likes
likeslikes
wrote
knowsknows
gremlin
wrote
emil
knows
likes
True Blood
likes
likes
tagged
With different types of things being related by different types of relations,you need to specify legal paths for the energy to flow over.
Eudaemonic Engine: Diffusion on a Property Graph
• Problem statement = Start vertices + path expression.
• Problem solution = Highest energy vertices at t.23 24 25
23Examples presented next are basic due to the simplicity of the toy graph example used. In such cases,queries as opposed to energy diffusions are best. In general, the purpose of an energy diffusion is toexpose recurrence/feedback in the graph. For the more technically inclined, think of it as determining theeigenvector of the graph defined by the path expression.
24Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks,” Knowledge-Based Systems,21(7), pp. 727–739, 2008. [http://arxiv.org/abs/0803.4355]
25Rodriguez, M.A., Neubauer, P., “A Path Algebra for Multi-Relational Graphs,” 2nd InternationalWorkshop on Graph Data Management (GDM11), 2010. [http://arxiv.org/abs/1011.0390]
Eudaemonic Engine: Friend Recommendation
marko
24
jen
The Wire
linkedprocess
intelligence graphs
peter
occupationoccupation
likes wrote
occupation
likes
likeslikes
wrote
knowsknows
gremlin
wrote
emil
knows
likes
True Blood
likes
likes
tagged
“Who are my friends’ friends that are not me or my friends?”26
26marko.outE[[label:’knows’]].inV.aggregate(x).outE.inV{!x.contains(it)}
Eudaemonic Engine: Product Recommendation
marko
24
jen
The Wire
linkedprocess
intelligence graphs
peter
occupationoccupation
likes wrote
occupation
likes
likeslikes
wrote
knowsknows
gremlin
wrote
emil
knows
likes
True Blood
likes
likes
tagged
“Who likes what I like? Of those things they like, what else do they likethat I don’t already like?”27
27marko.outE[[label:’likes’]].inV.aggregate(x).inE[[label:’likes’]].outV.outE[[label:’likes’]].inV{!x.contains(it)}
Eudaemonic Engine: Product Recommendation 2
marko
24
jen
The Wire
linkedprocess
intelligence graphs
peter
occupationoccupation
likes wrote
occupation
likes
likeslikes
wrote
knowsknows
gremlin
wrote
emil
knows
likes
True Blood
likes
likes
tagged
“Who likes what I like and what do they like? What do the people I knowlike? Of those things liked, what do I not already like?”
Eudaemonic Engine: Recommendation
• Different paths through a domain model expose different types ofrecommendations.
• Individual path preferences allow for an ecosystem of traversals (differentproblems can be solved over the same domain model).28 29 30
28Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support theScholarly Communication Process,” 2009. [http://arxiv.org/abs/0905.1594]
29Rodriguez, M.A., “Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, andRecommendation,” Technical Talk Seminar, AT&T Interactive, 2010.[http://slidesha.re/bOCy4Q]
30Traversal Patterns with Gremlin available at https://github.com/tinkerpop/gremlin/wiki/
Traversal-Patterns.
Universal Computer: A Single Computational Substrate
The year is 2023.
Life is good. Humans flourish. Virtuous men’s minds are filledwith wonderfully creative ideas. Inventions proliferate.
Advances in computer network technology yield anew model of computing.
Computer networks are no longer the bottleneck forspeed. Accessing local and remote data is no longerconsidered “different.” The distinction betweenRAM, disk drive, and Web disappears.
Universal Computer: A Computational Substrate
On the Web...
• Represent data.
• Represent code.
• Represent virtual machines.
Universal Computer: Represent Data
• URIs form an infinite universal address space.
• A URI can denote a datum.
? http://markorodriguez.com#self (Marko)? http://sws.geonames.org/4887398/about.rdf (Chicago)? http://data.nytimes.com/N38395718310308503251 (Malmo)
• RDF (Resource Description Framework) is a data model for linking URIsinto a multi-relational graph.
Universal Computer: Represent Data
127.0.0.1127.0.0.2
atti:marko nm:puppyatti:bestFriend
"2"^^xsd:integer "false"^^xsd:boolean
atti:numberOfLegsatti:hasFur
atti:numberOfLegsatti:hasFur
"4"^^xsd:integer "true"^^xsd:boolean
• The concept of atti:marko and the properties atti:numberOfLegs, atti:hasFur,
and atti:bestFriend is maintained by AT&Ti graph server.
• The concept of nm:puppy is maintained by a New Mexico graph server.
• The data types of xsd:integer and xsd:boolean are maintained by XML standards
organization.
Universal Computer: Represent Code
• Computing is a series of instructions — add, write, branch, goto...
• The URI address space and RDF glue can be seen as computationalmedium.31
_:123
"3"^^xsd:int "7"^^xsd:int
atti:Add
atti:left-op atti:right-op
rdf:type
rdf:subClassOf
atti:Instruction
31Rodriguez, M.A., “General-Purpose Computing on a Semantic Network Substrate,” Emergent WebIntelligence: Advanced Semantic Technologies, eds. R. Chbeir, A. Hassanien, A. Abraham, and Y. Badr, pp.57–104, 2010. [http://arxiv.org/abs/0704.3395]
Universal Computer: Represent Code
atti:marko nm:puppyatti:bestFriend
atti:pet
atti:hasMethod atti:isHappy
"false"^^xsd:boolean
_:1234
atti:argsatti:block
"animal"^^xsd:string
rdf:1
_:2345
_:3456
atti:inst
// make animal happy
Method
Represent methods and their instructions attached to objects/classes.
Universal Computer: Represent Virtual Machines
atti:marko nm:puppyatti:bestFriend
atti:pet
atti:hasMethod atti:isHappy
"false"^^xsd:boolean
atti:block
_:2345
_:3456
atti:inst
_:6789 atti:pc
atti:VM
rdf:type
Virtual Machine
write "true"^^xsd:boolean
Represent not only code, but the machines that execute it.
Universal Computer: Represent Virtual Machines
halt
Fhat
Instruction
programLocation
Frame
hasFrame
[0..*]
[0..1]
returnTop
ReturnStack
Instruction
rdf:firstrdf:rest
[0..1][0..1]
blockTop
[0..*]
FrameVariable
rdf:li
hasValue
rdfs:Resource
operandTop
OperandStack
rdfs:Resource
rdf:firstrdf:rest
[0..1]
[0..1]
[0..1]
RVM
[0..*]
hasSymbol
xsd:string
[1]
xsd:boolean[1]
forFrame[1]
fromBlock
Block
[1]
currentFrame
[0..1]
methodReuse
xsd:boolean[1]
[0..1]
BlockStack
Block
rdf:firstrdf:rest
[0..1]
[0..1]
[0..1]
NenoFhat Project (circa 2006): http://neno.lanl.gov.
API
Program
Machine Architecture
Virtual Machine State
Virtual Machine Processes
Physical Machines...
read/write read/write
Physics
Global Data Structure
127.0.0.1 127.0.0.4127.0.0.2 127.0.0.3
My Belief in Reality
Data
Universal Computer: A Ramification
• Data, APIs, code, machine architectures, and virtual machines are withinthe same global URI address space.
? Code can by physically distributed across computers. For example,an add instruction on 127.0.0.1 references a branch instruction on127.0.0.2.
? Hardware machines can be added or removed without altering thestate of computation — only the speed.
? No developer concept of RAM-based memory addresses — the onlyaddress space is the space of all URIs.
Universal Computer: Another Ramification
• Reflection down to the machine level.32
? Most languages support the manipulation of code at runtime. In thismodel, the virtual machine can be altered at runtime.
? Code can rewrite the virtual machine that is evaluating thecode. (i.e. create lots of bugs.)
32Rodriguez, M.A., The RDF Virtual Machine, LA-UR-08-03925, in review, 2009. [http://arxiv.org/abs/0802.3492]
The year is 2030.
Man learns to encode themselves into the URIaddress space...33 34
33Egan, G., “Permutation City,” Eos Publisher, 1995.34Rodriguez, M.A., “From the Signal to the Symbol: Structure and Process in Artificial Intelligence,”
Center for Nonlinear Studies Post Doctorate Seminar, Los Alamos National Laboratory, Los Alamos, NewMexico, 2008. [http://slidesha.re/hdqRn2]
Outline
• Graph Structures
• Graph Databases
• Graph Applications
• TinkerPop Product Suite
This is the TinkerPop...
TinkerPop Productions
• Blueprints: Data Models and their Implementations
[http://blueprints.tinkerpop.com]
• Pipes: A Data Flow Framework using Process Graphs
[http://pipes.tinkerpop.com]
• Gremlin: A Graph-Based Programming Language
[http://gremlin.tinkerpop.com]
• Rexster: A RESTful Graph Shell
[http://rexster.tinkerpop.com]35
35Please see http://engineering.attinteractive.com/2010/12/a-graph-processing-stack/ fora short review of these products.Also TinkerPop’s homepage at: http://tinkerpop.com
Blueprints: A Property Graph Model Interface
Blueprints
• Blueprints is the like the JDBC of the graph database community.
• Provides a Java-based interface API for the property graph data model.
? Graph, Vertex, Edge, Index.
• Connectors to TinkerGraph, Neo4j, OrientDB, Sails (e.g. AllegroGraph,HyperSail, etc.), and soon InfiniteGraph. Into the future, hope to supportInfoGrid, Sones, DEX, and HyperGraphDB.36
36HyperGraphDB makes use of an n-ary graph structure known as a hypergraph. Blueprints, in its currentform, only supports the more common binary graph.
Creating a Neo4jGraph in Blueprints// create a graph
Graph graph = new Neo4jGraph("/tmp/neo4j");
// add two vertices
Vertex a = graph.addVertex(null);
a.setProperty("name","marko");
Vertex b = graph.addVertex(null);
b.setProperty("name","peter");
// join the two vertices by a knows relation
Edge e = graph.addEdge(null,a,b,"knows");
e.setProperty("since","2007");
0 1knows
name=marko name=petersince=2007
Handy Features of Blueprints
• Supports automatic transactions
? graph.setTransactionMode(AUTOMATIC -or- MANUAL)
? In automatic mode, every manipulation of the graph is wrapped in atransaction and committed.
• Supports automatic indices
? graph.createIndex(AUTOMATIC -or- MANUAL)
? In automatic mode, elements are added or removed from an index astheir properties are manipulated.
• Utility Suite
? Blueprints Sail makes a graphdb into a traversal-based RDF store.? GraphML Reader/Writer library.
Pipes: A Data Flow Framework using Process Graphs
Pipes
• Lazy data flow with support for Blueprints-based graph processing.
• Provides a collection of “pipes” (implement Iterable and Iterator)that are connected together to form processing pipelines.
? Filters: ComparisonFilterPipe, RandomFilterPipe, etc.? Traversal: VertexEdgePipe, EdgeVertexPipe, PropertyPipe, etc.? Splitting/Merging: CopySplitPipe, RobinMergePipe, etc.? Logic: OrFilterPipe, AndFilterPipe, etc.
Pipes: Chained Iterators
This pipeline takes objects of type A and turns them into objects of type D
through a sequence of processing pipes...37
Pipe1A B Pipe2 C Pipe3 D
Pipeline
A
AA
A
D
DD
D
Pipe<A,D> pipeline =
new Pipeline<A,D>(Pipe1<A,B>, Pipe2<B,C>, Pipe3<C,D>)
37Though not discussed, splitting and merging is allowed as well (branching pipelines).
Pipes: A Simple Example
“What are the names of the people that marko knows?”
A C
B
D
knows
knows
created
createdname=marko
name=peter
name=pavel
name=gremlin
Pipes: A Simple Example
Pipe<Vertex,Edge> pipe1 = new VertexEdgePipe(Step.OUT_EDGES);
Pipe<Edge,Edge> pipe2= new LabelFilterPipe("knows",Filter.NOT_EQUAL);
Pipe<Edge,Vertex> pipe3 = new EdgeVertexPipe(Step.IN_VERTEX);
Pipe<Vertex,String> pipe4 = new PropertyPipe<String>("name");
Pipe<Vertex,String> pipeline = new Pipeline(pipe1,pipe2,pipe3,pipe4);
pipeline.setStarts(new SingleIterator<Vertex>(graph.getVertex("A"));
A C
B
D
knows
knows
created
createdname=marko
name=peter
name=pavel
name=gremlin
Pipes: A Simple Example
for(String name : pipeline) {
System.out.println(name);
}
A C
B
D
knows
knows
created
createdname=marko
name=peter
name=pavel
name=gremlin
peter
pavel
Pipes: A Simple Example
A C
B
D
knows
knows
created
createdname=marko
name=peter
name=pavel
VertexEdgePipe(OUT_EDGES)
LabelFilterPipe("knows")
EdgeVertexPipe(IN_VERTEX)
PropertyPipe("name")
name=gremlin
Pipes: A Simple Example
A C
B
D
knows
knows
created
createdname=marko
name=peter
name=pavel
VertexEdgePipe(OUT_EDGES)
LabelFilterPipe("knows")
EdgeVertexPipe(IN_VERTEX)
PropertyPipe("name")
name=gremlin
Pipes: A Simple Example
A C
B
D
knows
knows
created
createdname=marko
name=peter
name=pavel
VertexEdgePipe(OUT_EDGES)
LabelFilterPipe("knows")
EdgeVertexPipe(IN_VERTEX)
PropertyPipe("name")
name=gremlin
Pipes: A Simple Example
A C
B
D
knows
knows
created
createdname=marko
name=peter
name=pavel
VertexEdgePipe(OUT_EDGES)
LabelFilterPipe("knows")
EdgeVertexPipe(IN_VERTEX)
PropertyPipe("name")
name=gremlin
Pipes: Library of Generally Useful Pipes
[ FILTERS ]
AndFilterPipe
CollectionFilterPipe
ComparisonFilterPipe
DuplicateFilterPipe
FutureFilterPipe
ObjectFilterPipe
OrFilterPipe
RandomFilterPipe
RangeFilterPipe
[ SPLITS ]
CopySplitPipe
RobinSplitPipe
[ MERGES ]
ExhaustiveMergePipe
RobinMergePipe
[ GRAPHS ]
EdgeVertexPipe
IdFilterPipe
IdPipe
LabelFilterPipe
LabelPipe
PropertyFilterPipe
PropertyPipe
VertexEdgePipe
[ SIDEEFFECTS ]
AggregatorPipe
CountCombinePipe
CountPipe
KeyCombinePipe
SideEffectCapPipe
[ UTILITIES ]
DynamicStartsPipe
GatherPipe
PathPipe
PrintStreamPipe
ProductPipe
ScatterPipe
TypeCastPipe
Pipeline
...
Pipes: Easy to Create New Pipes
public class NumCharsPipe extends AbstractPipe<String,Integer> {
public Integer processNextStart() {
String word = this.starts.next();
return word.length();
}
}
When extending the base class AbstractPipe<S,E> all that is required isan implementation of processNextStart().
Pipes: Easy to Create New Pipes
domain specific
complex traversalalgorithms
com.tinkerpop.pipes
Most of my projects are composedof lots of application specific Pipes.That is, Pipes that are specific tomy domain model and yield usefuljumps in the graph. For example,
SameLikesPipe<Vertex,Vertex>.
From these domain specific Pipes,complex algorithms are createdthrough the piecing together ofthose Pipes. For example,
RecommenderPipe<Vertex,Map>.
Gremlin: A Graph-Based Programming Language
GremlinG = (V,E)
• A graph traversal language that uses Groovy as its host language.
• Compiles Gremlin syntax down to Pipes (implements JSR 223).38
38At the time of this presentation, Gremlin’s most recent stable release is 0.6 which is a standalonelanguage. To increase the flexibility of the language, 0.7-SNAPSHOT+ boasts the use of Groovy as the hostthe language.
Gremlin: Easily Compose Graph Related Pipes
Pipes is verbose...
Pipe<Vertex,Edge> pipe1 = new VertexEdgePipe(Step.OUT_EDGES);
Pipe<Edge,Edge> pipe2 = new LabelFilterPipe("knows",Filter.NOT_EQUAL);
Pipe<Edge,Vertex> pipe3 = new EdgeVertexPipe(Step.IN_VERTEX);
Pipe<Vertex,String> pipe4 = new PropertyPipe<String>("name");
Pipe<Vertex,String> pipeline = new Pipeline(pipe1,pipe2,pipe3,pipe4);
pipeline.setStarts(new SingleIterator<Vertex>(graph.getVertex("A"));
...relative to Gremlin.
g.v(‘A’).outE[[label:‘knows’]].inV.name
Gremlin: The Simple Example
A C
B
D
knows
knows
created
createdname=marko
name=peter
name=pavel
outE
[[label:'knows']]
inV
name
name=gremlin
g.v('A')
Gremlin: Defining a Step
“Who likes the same things that I like?”
Vertex.metaClass.same_like =
{ _().outE[[label:‘likes’]].inV.inE[[label:‘likes’]].outV }
A C
B
D
likes
likes
likes
G
E
F
likes
likes
likes
likes
Gremlin: Defining a Stepgremlin> g.v(‘A’).same_likes
==>v[E]
==>v[F]
==>v[F]
==>v[G]
A C
B
D
likes
likes
likes
G
E
F
likes
likes
likes
likes
Gremlin: Defining a Step
gremlin> m = g:id-v(‘A’).same_likes.group_count >> 1
gremlin> m
==>v[E]=1
==>v[F]=2
==>v[G]=1
v[F] is most similar, in terms of likes, to v[A].39
39For a thorough review of such traversal patterns, please see: Rodriguez, M.A., “Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Recommendation,” July 2010.[http://slidesha.re/bOCy4Q]
Rexster: A RESTful Graph Shell
reXster
• Allows Blueprints graphs to be exposed through a RESTful API (HTTP).
• All communication is via JSON.
• Supports stored traversals written in raw Pipes or Gremlin.
• Supports adhoc traversals represented in Gremlin.
• Provides “helper classes” for performing search-, score-, and rank-basedtraversal algorithms—in concert, support for recommendation.
Rexster: URI Patterns
• http://localhost/graph/vertices: all the vertices in the graph
• http://localhost/graph/vertices/1: vertex with id 1 in the graph.
• http://localhost/graph/vertices/1/outE: outgoing edges ofvertex with id 1.
{ "results": {
"_type":"vertex",
"_id":"1",
"name":"aaron",
"type":"person"
},
"query_time":0.1537 }
Typical TinkerPop Graph Stack
NativeStore TinkerGraphNeo4j
GET http://{host}/{resource}
Conclusion
• Property graphs are convenient structures for modeling the real-world.
• Graph databases provide index-free adjacency to ensure speedytraversal over graphs.
• The graph is such a general data structure that it can be used fornumerous applications.
• TinkerPop provides a database agnostic stack of technologies forworking with property graphs.
Acknowledgements
• Research collaborators: Daniel Steinbock (Stanford), Jennifer H.Watkins (LANL), Alberto Pepe (Harvard), Joshua Shinvaier (RPI), JohanBollen (LANL), Herbert Van de Sompel (LANL).
• TinkerPop contributors: Pavel Yaskevich (Riptano), Stephen Mallete(Independent), Darrick Weibe (Independent), Alex Averbuch (SwedishInstitute of CS), Peter Neubauer (Neo4j).
• Others: Emil Eifrem (Neo4j), Luca Garulli (Orient Technologies), AaronPatterson (AT&Ti).