Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The...

30
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering

Transcript of Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The...

Page 1: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

(Infinite Graph)

Vaibhav Khadilkar

Department of Computer Science,

The University of Texas at Dallas

FEARLESS engineering

Page 2: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Agenda

Motivation behind the project Semantic web technologies overview Proposed architecture Performance metrics

FEARLESS engineering

Page 3: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Motivation - Current Problems

Jena’s in-memory model does not scale Jena’s RDB and SDB models cannot handle large result sets Hinders ability to do reasoning and large graph processing Current work focuses on load balancing and fault tolerance Current systems can be broken with even 100,000 triples We work on load balancing and polynomial reasoning but

memory management breaks systems before any other problems can be addressed

FEARLESS engineering

Page 4: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Motivation - Relevance of the problem

This is an unsolved problem Critical in handling terabytes of data relevant in today’s times Move the problem from memory space to disk space

FEARLESS engineering

Page 5: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

FEARLESS engineering

Jena

In-memory RDB SDB ARQ

Extension

Reasoning

Page 6: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Semantic web technologies overview - Jena

Jena is a Java based framework that allows building Semantic web applications

Jena provides a programmatic environment for RDF, RDFS, OWL, SPARQL and includes a rule based inference engine

Jena allows the creation and manipulation of in-memory or relational database backed (RDB and SDB) RDF graphs

FEARLESS engineering

Page 7: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Semantic web technologies overview - Lucene

Lucene is a Java based text indexing and searching tool The smallest unit of text that Lucene indexes and searches is a Document A Document contains different fields and a corresponding value for each field The different fields are the indexes that can be used as keywords during a search

FEARLESS engineering

Page 8: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Problems with In-memory Jena Model

Ability to handle medium sized graphs As nodes are added memory fills up As more nodes are added, the program crashes with an out of

memory exception We want to solve this out of memory problem

FEARLESS engineering

Page 9: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

FEARLESS engineering

5. Continue adding triples

3. Buffer sorted based on memory management algorithm

4. Write triples based on sorted buffer while

triples left > x of Threshold

2. Added triples = Threshold

1. Add triples

In-memory triple store + buffer

Lucene triple store

Buffer Management Strategy

Page 10: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

FEARLESS engineering

4. Return result

3. Return result

2. If result not in memory

query Lucene triple store

1. Query model

In-memory triple store

Lucene triple store

Page 11: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Choice of Algorithm

Memory management algorithms such as LRU, MRU, FIFO, and LIFO Social network analysis measures such as degree centrality and individual clustering coefficient Combination of memory management algorithm with degree

centrality and individual clustering coefficient

FEARLESS engineering

Page 12: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

FEARLESS engineering

Page 13: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Choice of buffer and persistence strategy

Buffer can be created based on the subject, predicate, object or a combination of them Map Jena’s subject, predicate and object indexes to Lucene

indexes directly Create Lucene indexes as needed taking into account the nature of SPARQL queries and Jena’s implementation

FEARLESS engineering

Page 14: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

FEARLESS engineering

Page 15: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Conclusions from the in-memory model

Degree centrality is the best algorithm to choose a node to be persisted to disk

Creating Lucene indexes as needed is a better choice for the persistence strategy than creating all indexes at the same time

FEARLESS engineering

Page 16: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Problems with RDB Jena model

The RDB Jena model can add any number of triples to the relational database

When a query asking for a large number of triples is executed, the result set returned fills up memory causing the program to crash with an out of memory exception We want to solve this out of memory problem We leverage the previous in-memory extension to solve this

problem

FEARLESS engineering

Page 17: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Memory management algorithm

Algorithm We use the LIMIT and OFFSET clauses in SQL to get only a part of the results at a time The retrieved triples are added to the extended in-memory Jena model Thus we use the memory management algorithm from the in-memory model Since the revised in-memory model never runs out of memory this RDB solution never runs out of memory

FEARLESS engineering

Page 18: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Conclusions

Conclusions from the extended RDB model Model creation times are similar to the original RDB Jena model Query times vary based on the threshold value in the in-memory solution

General conclusions Implemented an in-memory cache based memory management algorithm Solves the memory problem for the in-memory and RDB Jena models by creating an impression of infinite memory for the user Moves the memory problem to disk space

FEARLESS engineering

Page 19: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Problems with SDB Jena Model

The SDB Jena model can add any number of triples to the relational database

When a query asking for a large number of triples is executed, the result set returned fills up memory causing the program to crash with an out of memory exception We want to solve this out of memory problem The SDB solution does not depend on the in-memory or RDB

extensions

FEARLESS engineering

Page 20: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Memory management algorithm

Algorithm We use the LIMIT and OFFSET clauses in SQL to get only a part of the results at a time The retrieved triples are returned as a separate iterator to the executing program

FEARLESS engineering

Page 21: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Inferencing in Semantic Web

Ontology specification - TBox Instance creation - ABox Inference - Generating new triples based on instances in the Abox backed by the TBox

FEARLESS engineering

Page 22: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Problems in inferencing with this extension

How do you do reasoning when the graph is divided between memory and disk ??

Scalability

FEARLESS engineering

Page 23: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

FEARLESS engineering

YesNo

Continue adding triples

2. Buffer sorted based on memory management algorithm

3. Write triples based on sorted buffer while

triples left > x of Threshold

1. Added triples = Threshold

Add triples

In-memory triple store + buffer

Lucene triple store

Buffer Management Strategy

Is triple a part of TBox??

Triple store

In-memory triple store

Page 24: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

7. Return result

Managing Large RDF Graphs

FEARLESS engineering

2. Get TBox triples

1. Query

6. Return result

5. Return result

4. If result not in memory

query Lucene triple store

3. Query for ABox triples

In-memory triple store

Lucene triple store

Pellet ReasonerIn-memory triple store

Page 25: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Choice of Algorithm

Memory management algorithms such as LRU, MRU, FIFO, and LIFO Social network analysis measures such as degree centrality and individual clustering coefficient Combination of memory management algorithm with degree

centrality and individual clustering coefficient

FEARLESS engineering

Page 26: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

FEARLESS engineering

Page 27: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Choice of buffer and persistence strategy

Buffer can be created based on the subject, predicate, object or a combination of them Map Jena’s subject, predicate and object indexes to Lucene

indexes directly Create Lucene indexes as needed taking into account the nature of SPARQL queries and Jena’s implementation

FEARLESS engineering

Page 28: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs

FEARLESS engineering

Page 29: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Conclusions from the inference model

RANDOM is the best algorithm to choose a node to be persisted to disk Creating all Lucene indexes at the same time is a better choice for the persistence strategy than creating the indexes one at a time

FEARLESS engineering

Page 30: Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Managing Large RDF Graphs Future Work

Test all models with benchmark data Generalize the algorithm to be able to handle multiple

incarnations of nodes over time Improve the efficiency of all algorithms Try other algorithms for selecting the candidate node to be

written to disk

FEARLESS engineering