Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The...
-
Upload
augusta-hines -
Category
Documents
-
view
218 -
download
0
Transcript of Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The...
Managing Large RDF Graphs
(Infinite Graph)
Vaibhav Khadilkar
Department of Computer Science,
The University of Texas at Dallas
FEARLESS engineering
Managing Large RDF Graphs Agenda
Motivation behind the project Semantic web technologies overview Proposed architecture Performance metrics
FEARLESS engineering
Managing Large RDF Graphs Motivation - Current Problems
Jena’s in-memory model does not scale Jena’s RDB and SDB models cannot handle large result sets Hinders ability to do reasoning and large graph processing Current work focuses on load balancing and fault tolerance Current systems can be broken with even 100,000 triples We work on load balancing and polynomial reasoning but
memory management breaks systems before any other problems can be addressed
FEARLESS engineering
Managing Large RDF Graphs Motivation - Relevance of the problem
This is an unsolved problem Critical in handling terabytes of data relevant in today’s times Move the problem from memory space to disk space
FEARLESS engineering
Managing Large RDF Graphs
FEARLESS engineering
Jena
In-memory RDB SDB ARQ
Extension
Reasoning
Managing Large RDF Graphs Semantic web technologies overview - Jena
Jena is a Java based framework that allows building Semantic web applications
Jena provides a programmatic environment for RDF, RDFS, OWL, SPARQL and includes a rule based inference engine
Jena allows the creation and manipulation of in-memory or relational database backed (RDB and SDB) RDF graphs
FEARLESS engineering
Managing Large RDF Graphs Semantic web technologies overview - Lucene
Lucene is a Java based text indexing and searching tool The smallest unit of text that Lucene indexes and searches is a Document A Document contains different fields and a corresponding value for each field The different fields are the indexes that can be used as keywords during a search
FEARLESS engineering
Managing Large RDF Graphs Problems with In-memory Jena Model
Ability to handle medium sized graphs As nodes are added memory fills up As more nodes are added, the program crashes with an out of
memory exception We want to solve this out of memory problem
FEARLESS engineering
Managing Large RDF Graphs
FEARLESS engineering
5. Continue adding triples
3. Buffer sorted based on memory management algorithm
4. Write triples based on sorted buffer while
triples left > x of Threshold
2. Added triples = Threshold
1. Add triples
In-memory triple store + buffer
Lucene triple store
Buffer Management Strategy
Managing Large RDF Graphs
FEARLESS engineering
4. Return result
3. Return result
2. If result not in memory
query Lucene triple store
1. Query model
In-memory triple store
Lucene triple store
Managing Large RDF Graphs Choice of Algorithm
Memory management algorithms such as LRU, MRU, FIFO, and LIFO Social network analysis measures such as degree centrality and individual clustering coefficient Combination of memory management algorithm with degree
centrality and individual clustering coefficient
FEARLESS engineering
Managing Large RDF Graphs
FEARLESS engineering
Managing Large RDF Graphs Choice of buffer and persistence strategy
Buffer can be created based on the subject, predicate, object or a combination of them Map Jena’s subject, predicate and object indexes to Lucene
indexes directly Create Lucene indexes as needed taking into account the nature of SPARQL queries and Jena’s implementation
FEARLESS engineering
Managing Large RDF Graphs
FEARLESS engineering
Managing Large RDF Graphs Conclusions from the in-memory model
Degree centrality is the best algorithm to choose a node to be persisted to disk
Creating Lucene indexes as needed is a better choice for the persistence strategy than creating all indexes at the same time
FEARLESS engineering
Managing Large RDF Graphs Problems with RDB Jena model
The RDB Jena model can add any number of triples to the relational database
When a query asking for a large number of triples is executed, the result set returned fills up memory causing the program to crash with an out of memory exception We want to solve this out of memory problem We leverage the previous in-memory extension to solve this
problem
FEARLESS engineering
Managing Large RDF Graphs Memory management algorithm
Algorithm We use the LIMIT and OFFSET clauses in SQL to get only a part of the results at a time The retrieved triples are added to the extended in-memory Jena model Thus we use the memory management algorithm from the in-memory model Since the revised in-memory model never runs out of memory this RDB solution never runs out of memory
FEARLESS engineering
Managing Large RDF Graphs Conclusions
Conclusions from the extended RDB model Model creation times are similar to the original RDB Jena model Query times vary based on the threshold value in the in-memory solution
General conclusions Implemented an in-memory cache based memory management algorithm Solves the memory problem for the in-memory and RDB Jena models by creating an impression of infinite memory for the user Moves the memory problem to disk space
FEARLESS engineering
Managing Large RDF Graphs Problems with SDB Jena Model
The SDB Jena model can add any number of triples to the relational database
When a query asking for a large number of triples is executed, the result set returned fills up memory causing the program to crash with an out of memory exception We want to solve this out of memory problem The SDB solution does not depend on the in-memory or RDB
extensions
FEARLESS engineering
Managing Large RDF Graphs Memory management algorithm
Algorithm We use the LIMIT and OFFSET clauses in SQL to get only a part of the results at a time The retrieved triples are returned as a separate iterator to the executing program
FEARLESS engineering
Managing Large RDF Graphs Inferencing in Semantic Web
Ontology specification - TBox Instance creation - ABox Inference - Generating new triples based on instances in the Abox backed by the TBox
FEARLESS engineering
Managing Large RDF Graphs Problems in inferencing with this extension
How do you do reasoning when the graph is divided between memory and disk ??
Scalability
FEARLESS engineering
Managing Large RDF Graphs
FEARLESS engineering
YesNo
Continue adding triples
2. Buffer sorted based on memory management algorithm
3. Write triples based on sorted buffer while
triples left > x of Threshold
1. Added triples = Threshold
Add triples
In-memory triple store + buffer
Lucene triple store
Buffer Management Strategy
Is triple a part of TBox??
Triple store
In-memory triple store
7. Return result
Managing Large RDF Graphs
FEARLESS engineering
2. Get TBox triples
1. Query
6. Return result
5. Return result
4. If result not in memory
query Lucene triple store
3. Query for ABox triples
In-memory triple store
Lucene triple store
Pellet ReasonerIn-memory triple store
Managing Large RDF Graphs Choice of Algorithm
Memory management algorithms such as LRU, MRU, FIFO, and LIFO Social network analysis measures such as degree centrality and individual clustering coefficient Combination of memory management algorithm with degree
centrality and individual clustering coefficient
FEARLESS engineering
Managing Large RDF Graphs
FEARLESS engineering
Managing Large RDF Graphs Choice of buffer and persistence strategy
Buffer can be created based on the subject, predicate, object or a combination of them Map Jena’s subject, predicate and object indexes to Lucene
indexes directly Create Lucene indexes as needed taking into account the nature of SPARQL queries and Jena’s implementation
FEARLESS engineering
Managing Large RDF Graphs
FEARLESS engineering
Managing Large RDF Graphs Conclusions from the inference model
RANDOM is the best algorithm to choose a node to be persisted to disk Creating all Lucene indexes at the same time is a better choice for the persistence strategy than creating the indexes one at a time
FEARLESS engineering
Managing Large RDF Graphs Future Work
Test all models with benchmark data Generalize the algorithm to be able to handle multiple
incarnations of nodes over time Improve the efficiency of all algorithms Try other algorithms for selecting the candidate node to be
written to disk
FEARLESS engineering