Rdf, Sw, Sparql Final
-
Upload
faisal-satti -
Category
Documents
-
view
235 -
download
0
Transcript of Rdf, Sw, Sparql Final
-
8/3/2019 Rdf, Sw, Sparql Final
1/18
RDF TRIPLE STORES,
SPARQL AND THE SEMANTIC
WEB.
Muntazir Mehdi
Department of Computer Science
Technical University Kaiserslautern
67653, Kaiserslautern, Germany
Abstract. Current research in the area of the World Wide Web has mainly
focused on the advent of a technology which enables machines to understand
data. This results in a whole new type of Web which contains meaningful
metadata in addition to the linked documents and their relationships, enabling
roaming agents to extract useful information with an automated process. This
new type of web is named Semantic Web. For sake of unified standards all
developments in area of Semantic Web are being handled by World Wide WebConsortium (W3C). This paper briefly discusses some of already existing
standards; first we provide a brief introduction about Resource Description
Framework (RDF), SPARQL: Query Support for RDF and their syntax than we
look into Data Management techniques for RDF triples and finally we conclude
the paper by summarizing the individual parts of this paper.
Keywords: Semantic Web, Resource Description Framework (RDF), SPARQL,
RDF triple stores, RDF Data Management.
1 IntroductionToday the Web focuses only on the syntactic representation of the information.
This information is nothing more than just network of documents linked together in
the form of web pages. This information is very much understandable to humans and
thats itself the biggest drawback of it. Keeping in mind the drastic advancement in
the field of internet, one can clearly see that the internet which once was created as
communication infrastructure to facilitate communication between parties has
evolved into information infrastructure where expectations are to extract information.
This itself has some very important issues to be addressed e.g. What is the proper
mailto:[email protected]:[email protected]:[email protected] -
8/3/2019 Rdf, Sw, Sparql Final
2/18
2 Muntazir Mehdi
information? Where is the proper information? And when is the proper information
needed? The word Semantics literally translates to Meaning of linguistic term.
Semantic Web basically is web of content where web pages are linked with the help
of semantic relation among them thus helping machines to process information in
addition to humans which in turn is the most important improvement as seen in many
writings.
The Semantic Web will bring structure to the meaningful content of web pages,
creating an environment where software agents roaming from page to page can
readily carry out sophisticated tasks for users. [1]
Now that we have seen some idea about the traditional web and semantic web we
can easily answer the question, Why do we need Semantics? But this is not enough
because the basis of Semantic web is still the old traditional web, therefore the issue
of representing information still exists. Knowledge /Information representation is themost important part of Semantic Web. Unlike traditional web where information was
represented with the help of HTML and other scripting languages which only
represented the syntactic features of information, Semantic Web demands for a
language which has the capability to incorporate semantic features so that the
information can be inferred. Resource Description Framework (RDF) is one of the
most popular Semantic Web languages which derives its root from XML. XML which
itself is powerful enough to be used for information representation where a single
domain of information or knowledge can be coded with multiple styles. Coding a
single domain with multiple styles will result in a much larger complexity due to a
wide range of unknown communication participants. Resource Description
Framework (RDF) is a framework and standard data interchange model specified by
the family of World Wide Web Consortium (W3C) for modeling and representinginformation [2]. The basics of Resource Description Framework focus on the
statement From machine readable to machine understandable and have two
important parts i.e. RDF Model and RDF Syntax which are further discussed in
Section 2.1. Figure 1 shows the famous Semantic Web layer cake where Resource
Description Framework (RDF) can be seen as a significant block.
When thinking about Semantic Web one has to consider the data storage and
management. The data management step of Semantic Web hasnt been a famous topic
among researchers but now that the area has matured over time; many researchers
have proposed many ideas about storage and management of Semantic Web data
model i.e. Resource Description Framework RDF [2]. The diverse data models of
Semantic Web demands a totally new way of storing data. In RDF the information is
captured in the form of statements and those statements are represented with the help
of (subject, predicate, object) or (subject, property, value). For example, a simple
statement Technical University Kaiserslautern is located in Kaiserslautern,
Germany can be represented using the directed graph in Figure 2 and can be
represented as (subject: Technical University Kaiserslautern, predicate: isLocatedIn,
object: Kaiserslautern) triple. More triples on the resource/subject i.e. Technical
University Kaiserslautern in our example can be created which will results in a
complete set of information. This information is first broken into statements and then
translated into triples. These triples then can be stored in different ways. In this paper,
-
8/3/2019 Rdf, Sw, Sparql Final
3/18
RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 3
in Section 3 we will discuss some of the known data management techniques
including their architecture, effects of querying them and their performance.
Fig. 1. Semantic Web layer cake
Fig. 2. Directed Graph for RDF statement.
Another significant block that can be seen in the Semantic Web layer cake (Figure
1) is the Rules/Query block which has its own importance. Once information
representation and data management steps are completed its a very gruesome task to
fetch information we need specifically. There are multiple ways to query RDF, one of
the known query language support for Resource Description Framework (RDF) is
SPARQL which is a recursive acronym for SPARQL Protocol and RDF QueryLanguage. SPARQL is not the only available query support for RDF, many other
flavors also exist i.e. RQL, SeRQL, TRIPLE, RDQL, N3 and Versa. Many data
management engines and stores have their own query support. SPARQL is considered
to be a key Semantic Web technology and is a W3C Recommendation because of its
capability to query on variety of data sources, whether the data is stored natively as
RDF or it is being viewed as RDF with the help of additional middleware [3]. The
detailed syntax and structure of both RDF and SPARQL are explained in Section 2.1
-
8/3/2019 Rdf, Sw, Sparql Final
4/18
4 Muntazir Mehdi
and 2.2 respectively. SPARQL is also briefly discussed in Section 3 where we talk
about the management of RDF data.
2 RDF & SPARQL: Concepts, Syntax and Structure2.1 RDF
We have had a brief introduction about Resource Description Framework (RDF)
above. In the following Section we will have a detailed look at RDF. Since RDF is
composed of two important parts i.e. RDF Model and RDF Syntax, let us start with
explaining those things in little more detail in the following sections.
RDF Basic Concepts
For the sake of understanding let us once again consider a simple example where
we try to state some information about something:
TU Kaiserslautern is located in Kaiserslautern
For human understanding, this statement about TU Kaiserslautern is simply
described in simple English. By looking at above statement one can say that the
statement can be broken down into different parts and for understanding each part the
statement each part of the statement should be identified. In our example we see that
the statement is being made about TU Kaiserslautern which is a university, it is
located in some place and the place is Kaiserslautern . For sake of identification letus reformat the statement and write in other words where TU Kaiserslautern can be
easily and uniquely identified as a standalone entity since there may a lot of
universities which are located in Kaiserslautern.
http://www.uni-kl.de is located in Kaiserslautern
Now let us once again break down the information and see which blocks constitute
to describe the statement.
The statement is made about a thing i.e. http://www.uni-kl.de The statement has a property concerned with the thing it explains i.e. is located in. The property of the thing has a value i.e. Kaiserslautern.
Since the statement has been broken down, each part of the statement can be
individually identified. In our case the thing/resource/subject is http://www.uni-
kl.de, the property/predicate attached to it is isLocatedIn and the value/object for
the property is Kaiserslautern. More information about this resource can be made
again using simple English sentences.
University of Kaiserslautern has a department of Computer Science
http://www.uni-kl.de was founded in july, 1970
Note that all statements made above have information for a single subject i.e. TU
Kaiserslautern but the problem is that the subject is mentioned in three different
ways. The main idea of RDF is to describe resources, where resources have properties
-
8/3/2019 Rdf, Sw, Sparql Final
5/18
RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 5
and those properties have values. RDF uses a specific terminology for dealing with
parts of a certain statement [2]. The part where the statement describes a resource is
called subject, the part of the statement which states the property or characteristic of
subject is called property/predicate and the part of the statement which addresses the
value of the predicate or property is called value/object. For example in our statement:
Subject = TU Kaiserslautern/http://www.uni-kl.de/University ofKaiserslautern.
Predicate/Property = locatedIn/hasDepartment/foundedOn. Object/Value = Kaiserslautern/Computer Science/July, 1970.
As we know that till now we have been talking about human understanding, in
order to make this information processable for machines RDF requires the followingtwo important things [2]:
1.A language that is already processable by machines and can represent andexchange these statements.
2.A system of identifying each part of the statement without any ambiguity toidentify them with resources available on web and is also machine processable.
The World Wide Web has already two solid mechanisms for identification which
are already machine processable. The Uniform Resource Locater (URL) which
specifies the location of resource and Uniform Resource Identifier (URI) which
itself is a super set of URL and can be created by any organization or person
independently. When we look at our example, luckily we have a resource that has aURL identifier i.e. http://www.uni-kl.de but what about resources which have no
web location or URLs e.g. Credit card, human beings, telephone bill. URIs have the
capability to identify such resources which are 1) on the web, 2) not on the web and
3) abstract concepts.
RDF Information can be written easily by anyone independently using XML [2].
RDF defines a specific programming language for the representation of information,
since XML is already a machine processable and exchangeable format, RDF uses a
variation of XML i.e. RDF/XML which follows a simple syntax similar to XML.
There is another relatively new serialization and structure for RDF data representation
named Turtle [15] is available that has become very much famous for SPARQL query
syntax but in this paper we will only discuss RDF/XML.
RDF Model
RDF data can be represented in the form of triples which follow a certain pattern or
can be represented in the form of a directed graph.
Now that we know that a statement can be broken down into parts and these parts
can be identified using URIs, we can use RDF triples to represent the information. An
example of such can be seen as follows:
-
8/3/2019 Rdf, Sw, Sparql Final
6/18
6 Muntazir Mehdi
"July,
1970".
The graph representation of such information can be seen in the following figure:
The key representation notation is that the resources are represented using oval
shapes, predicates are represented using directed graph edges and literal values are
represented using rectangles. A single arc represents a single triple; a triple consists of
subject, predicate and object (which can also be a resource). The graph evolves since
objects which serve as resource themselves can have their own properties and
properties.
Another notable thing that can be used while representing information is that the
URIs are mostly long strings, therefore for making it look symmetric and easily
understandable RDF provides with the use of prefixes. This substitution is made using
XML references which are added in the beginning. A fully qualified name of URI is
substituted using XML prefix. A simple example can be seen as follows:
Prefix: ct, namespace URI: http://www.abc.com/customTypeThus the predicate becomes:
ct: isLocatedIn, ct: hasDepartment, ct: foundedOn.
RDF Syntax
-
8/3/2019 Rdf, Sw, Sparql Final
7/18
RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 7
As discussed earlier, RDF uses XML structure for the representation of
information, however the flavor of XML that RDF uses is a totally different
specification. The RDF names it as RDF/XML [2].
For understanding RDF let us look at the code example given below for the
following piece of information that is represented using RDF/XML syntax.
Technical University Kaiserslautern is located in Kaiserslautern. The university
has department of Computer Science. It was founded in July, 1970.
July, 1970
The very first line i.e.
indicates the content following this line is in XML.
The piece of code that says
-
8/3/2019 Rdf, Sw, Sparql Final
8/18
8 Muntazir Mehdi
ern" /> and those properties which have their values as another rdf resource will
be represented in the form of .
marks the end of the RDF content.
A single RDF may contain information about more than one resources, all of them
separated by their respective tags.
Above we presented the very basic syntax and structure using an example. A
detailed syntax and specification can be further seen in [2].
2.2 SPARQLA lot of work has been done to develop a query that fulfills the requirements of all
Semantic Web standards. The race is always towards creating a query language that isvery much similar to SQL and has the potential to deal with Semantic Data. Many
query languages have been proposed in literature (all having their own pros and cons);
SPARQL [3] has proved to be a query language that has been a center point for many
researchers.
RDF doesnt only emulate the SQL syntax but it also support full pattern matching,
optional pattern matching, conjunction and disjunction. The only fallback from
SPARQL that can be observed very easily is its inability to alter RDF stored data.
As we already know that the basic idea of RDF is representation of information in
form of RDF triple consisting subject, predicate and object, SPARQL is not an
exception. SPARQL is also built on the same triple pattern.
SPARQL Syntax and Structure
In this Section we will have a look into the basic syntax and structure of SPARQL
query. For understanding let us look at a query example given below:
PREFIX ct:
SELECT ?name
WHERE
{
ct:hasDepartment ?name.
}
Let first see what happens when this query is executed on the RDF data we have
been using till now. The output of the query would be:
If we break down the query mentioned in the code segment above, we will be able
to explain each and every part in detail. The very first line in the code segment begins
-
8/3/2019 Rdf, Sw, Sparql Final
9/18
RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 9
with the keyword PREFIX. When we were dealing with RDF triples, we had to
identify each part of the statement using an identifier, but identifiers tend to be large
strings. Therefore, we used XML namespaces with prefixes through which we were
able to reduce long strings. The keyword mentioned here is equivalent to that. The
first and foremost part is to declare any prefixes. A single query can have multiple
prefixes used wherever necessary.
The Second line of the code segment has SELECT keyword. As we have already
discussed the SPARQL draws its roots from standard SQL. The SELECT keyword
marks the beginning of the query. It has the same concept of that in SQL; it defines
the variable which we want the query to return us also it binds the variable to the
output we expect to receive.
The FROM keyword which is not mentioned in our example since we are using the
local RDF data, once again works in the similar fashion of that of SQL. This keywordidentifies the dataset on which we want our query to be executed. It can identify a
local file as well as a remote file.
In last we have WHERE keyword. As we know both representation of RDF data
i.e. triple & graph. The match between graph or triple representation is made between
the pattern we have and the pattern we specify in the braces after WHERE. A
WHERE clause can have more than one pattern specified to it, each separated by a
dot at the end. The WHERE clause is also optional as in case of SQL and can easily
be omitted.
In the above example and explanation we have seen the very basic syntax and
structure of SPARQL, a detailed specification for SPARQL and more complex
queries with additional query possibilities see [3].
3 Data ManagementIn this Section we will discuss some data management techniques for Resource
Description Framework (RDF) which includes a state of the art relational DBMS data
storage solution for RDF i.e. Sesame [4], a performance enhancing and data model
decomposition [9] technique for storage of RDF data i.e. Vertically Partitioned
Approach [5, 6] and an engine implementation which follows a RISC-Style
architecture for achieving high performance through SPARQL queries applied on
RDF data i.e. RDF-3X [7, 8].
3.1 SesameSesame is a standard framework for processing RDF data. It is an open source java
framework for storing, querying and reasoning about RDF and RDF schemas. It can
be used for both as database storage and java library for developing application to
work with RDF and RDF schema. The implementation of Sesame follows a generic
architecture i.e. Sesame Architecture [4] which is further discussed in this paper
below. The implementation of Sesame has been designed carefully with flexibility to
support variety of storage systems (relational databases, in-memory, file systems) and
-
8/3/2019 Rdf, Sw, Sparql Final
10/18
10 Muntazir Mehdi
offers a wide range of tools to developers to utilize the power of RDF and Semantic
Web standards. Sesame also includes support for SPARQL over both local and
remote stores access transparently with same API.
A packaged product and source code for Sesame can be downloaded from
http://www.openrdf.org/download.jsp.
Sesames Architecture Overview
The overall architecture of Sesame can be seen in Figure 3, individual components
are further explained here.
The RDF data i.e. the final output of Sesame in Sesame architecture is stored in a
scalable repository; RDF is stored in various different ways depending on the
selection of repository. A DBMS suits this condition very well. We already know that
there are a wide range of DBMS systems available, each of them having their own
features supporting their usefulness and strength. The Sesame is implemented in a
DBMS-Independent fashion. In order to achieve it, the code specific to DBMS is
concentrated in a single architectural layer i.e. the Storage and Interface Layer
(SAIL). This layer serves as a client to the main functional modules of Sesame. SAIL
is just an API which is responsible for translating the RDF specific requests made by
the functional modules to their specific DBMS. The functional modules are further
discussed below.
The packaging of Sesame is done in a manner that it can be implemented as both
web application and a web service. The packaged implementation is deployed on a
web container supporting java servlets and then can be accessed via HTTP/S or
SOAP. For scalability purposes handlers for each way of communication are added
separately. An additional protocol handler can be added for accessing Sesame via adifferent way of communication. The request router is responsible for receiving
request from protocol handlers and routing the requests to the respective functional
modules and vice versa.
http://www.openrdf.org/download.jsphttp://www.openrdf.org/download.jsphttp://www.openrdf.org/download.jsp -
8/3/2019 Rdf, Sw, Sparql Final
11/18
RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 11
Fig. 3.Sesames Architecture
Sesames Functional Modules
The Query Module:
The Query module used in Sesames implementation uses RQL [10] but in a
different fashion which corresponds very much to W3C recommendations, with
support of domain and range restrictions to both optional and multiple. Some people
also name it SeRQL acronym for RQL (RDF Query language) for Sesame. But in this
paper we will not consider any specific query language for explaining this module.
The path that is followed by this module while responding to a request is shown in
figure 4. This model carries out two important functions on a query i.e. Parsing and
Optimization. While dealing with a query the module initially parses the query and
creates a query tree model. This tree is then forwarded to an optimizer which creates
an optimized version of the query tree model.
For example a SPARQL query can be translated into an SQL query, optimized
with respect to the underlying DMBS and then forwarded for execution.
Fig. 4. The Query Module flow path
The Admin Module:
The two main functions of the Admin module are to incrementally add RDF(S)
data into repository and clean up the repository. For populating the repository with
information extracted from RDF(S) a simple process is followed. Generally the
RDF(S) data is available online or locally in form of serialized XML (the extensions
may vary; both .xml and .rdf(s) are applicable). Many parsers are available to extract
data from these serialized XML files e.g. Jenna toolkit. The parser receives the XML
file and after parsing the information produces the data in the form of (subject,
predicate, object) or (subject, property, value) RDF triples. The admin module than
communicates with SAIL and inserts the data in to the repository. Reporting of errors
and warnings is also the responsibility of this module.
The RDF Export Module:
The simplest part of the Sesame Architecture is the Export Module. This module is
only responsible for exporting RDF(S) data. Schema information is useful for some
tools and RDF data is useful for some tools and in some cases both schema and data
are useful depending on scenario. Based on the request made, this module has the
capability to export the schema or data or both. After communicating with SAIL the
-
8/3/2019 Rdf, Sw, Sparql Final
12/18
12 Muntazir Mehdi
schema receives the triples data and produces a serialized XML formatted file. This
enables Sesame to be integrated with other RDF tools.
3.2 Vertically Partitioned ApproachThe Vertically Partitioned Approach [5, 6] is an alternative approach to Property
Table. In order to understand this approach lets first have a basic understanding of
Property Table.
Usually RDF data is parsed into (subject, predicate, object) or (subject, property
value) triples first and then it is fed into RDBMS. Since many literals are mostly large
strings and it is inefficient to apply a pattern based querying on it; an approach to
reduce them is used to further increase the performance. In this approach a simple
mapping is created between literals which are very long and an identifier table [13]. Asimple example for storing RDBMS data into one table can be seen in figure 5(a).
However the drawback of this simple and straight forward approach is query
processing time it takes to retrieve results from the store. For this purpose, researchers
at Jena Semantic Web toolkit, Jena2 [11, 12] proposed property table concept which
is considerably efficient for query processing. The proposal contains two types of
property table. The first type of property table is known as Clustered Property Table
where clusters of properties which are common to most subjects are grouped together
and a table is formed, the rest of the triples are inserted into a table which is same as
that of RDBMS. An example of this can be seen in figure 5(b). The second type of
property table known as Property-Class Table uses the property part of an RDF triple.
This type of property table creates classes based on properties which are very much
common among subject and groups those subjects into individual tables. Again theleft over triples are stored using the same technique as in RDBMS and Clustered
Property Table. This technique of storing triples with respect to their classes has been
found useful by Jena2 and is also very much effective while storing reified
statements. Reification in Semantic Web is defined as Statement about Statement for
example one statement is Earth revolves round the Sun and another statement that
reifies this statement is Scientists believe that earth revolves round the sun. While
storing reified statements, RDF:Statement is considered as class and the properties are
RDF:Subject, RDF:Property and RDF:Object. Example for Property-Class Table is
also shown in figure 5(c).
Now that we have seen the Property table technique, let us look into the alternative
approach that enhances the query performance by using fully decomposed storage
model [9]. The Vertically Partitioned Approach is a very simple and straight forward
approach where the tables are created by using the unique properties in the data. All
unique properties from triples are extracted and then inserted into respective tables.
The table consists of two columns, first column is the subject and the second column
is for property value. The most interesting and performance enhancing part of this
approach is the sorting that is applied on the subject column of individual tables. This
enables locating the subject quickly hence fast merge joins can be used to construct
the required information about multiple subsets of subjects. An example of this
approach is shown in figure 6.
-
8/3/2019 Rdf, Sw, Sparql Final
13/18
RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 13
The Vertically partitioned approach has several advantages over Property Table
technique. Some of them are listed below:
Support for multi-valued attributes: Those subject which have more than one property
value for a particular property can be easily stored in decomposed storage model. The
technique is to add the values in succession.
Support for Heterogeneous records: This is the biggest advantage of vertically
partitioned approach over property table. When dealing with unstructured data or
poorly structured data there are always possibilities of missing property values among
subjects. The idea here is to simple omit them while populating the table or in simple
words Null values need not to be mentioned anymore.
Fig. 5. RDF Triple data and property table examples
Certainly this approach has its own disadvantages in some scenarios, however it
has been observed that when compared to property table technique this approach has
upper hand. A detailed performance comparison for both of these approaches can be
seen in Section 6 of [5, 6].
-
8/3/2019 Rdf, Sw, Sparql Final
14/18
14 Muntazir Mehdi
Fig. 6. Vertically Partitioned Approach example
3.3 RDF-3XRDF-3X as the name suggests is an engine implementation which covers 3 salient
features. 1) The implementation follows a generic solution for implementing storage
of RDF data in a manner that no further tuning should be required, 2) a query
processor and 3) a query optimizer.
Storage and Indexing
Triples Store and Dictionary:
As discussed earlier, the current state of art schema for storing RDF data is
Property Table but here once again the engine will use a simple approach where all
RDF data is extracted in form of (subject, predicate, object) or (subject, property,
value) triples. Once RDF triples are extracted they are stored in a repository. The
repository is custom storage implementation instead of using RDBMS. This supports
the concept of using RISC-Style and design principle. As mentioned earlier in Section
3.2 the costs of directly storing RDF triples in a single table, the engine
implementation defends the criticism that a single table incurs too many self-joins by
creating indexes which prove to be very efficient.
Once again the notion of using a dictionary which is a mapping of large string
literals to an identifier table is used as before. The cost for this would be indexing the
dictionary however this will gain two benefits i.e. 1) compression of triple store and
2) simplification for query processor. All triples are stored in clustered B-tree and the
tree is sorted alphabetically. The use of this data structure will help in conversion of
SPARQL patterns into range scans. Another advantage is when a specific pattern
matching is applied, the binding to every unknown literal can be found in a single
scan in logarithmic amortized time.
Compressed Indexes:
While applying pattern matching on triple store we always rely on the fact that the
pattern is always supplied in the standard format. However in many cases patterns can
have different forms. For sake of producing results in one scan of any supplied pattern
in any order, the engine uses permutations of all three: subject, predicate and object.
This will ultimately result in six different results for a single triple; however the
-
8/3/2019 Rdf, Sw, Sparql Final
15/18
RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 15
engine overcomes the redundancy by applying the compression. The standard
ordering of the pattern is (subject (s), predicate (p), object (o)). The six possibilities
for a single triple then become (SOP, SPO, OSP, OPS, POS, PSO). While storing
each permutation in leaf page of clustered B-Tree, each permutation is first sorted
alphabetically. A detail on compressing the indexes and compression algorithm used
with comparison to other algorithms can be found in [7, 8] in Section 3.2.
Aggregated Indices:
Additional aggregated indexes are created where two out of three columns of a
triple are considered. In other words it can also be said that two entries from a set of
three possible entries are extracted along with a count i.e. the number of occurrence of
this pair in all set of triples. This is done for all six possible permutations and then
stored in the database. The compression is applied once again and the effect of addingthem seems almost negligible. The same is done for a single entry, where a single
column is considered and a count is kept and then stored. The compression once again
enables the effect of storing them negligible. The reason behind using aggregated
indexes is simplifying translation of query. As from many SPARQL query patterns it
can be observed that partial triples are sufficient.
Query Processing and Query Optimization
Translating SPARQL Queries:
In order to optimize the query it is necessary to first transform it into calculus
representation. A query graph representation is constructed which can be used as
relational tuple calculus since it is easier to optimize. Every supplied query is firstparsed and expanded into set of triples. A triple consists of either literal or variable.
The mapping of literals is done using the dictionary concept used earlier and ids are
retrieved.
When supplying conjunctive query; while expanding the query into set of triples if
the query consists of a single triple than a single result is retrieved and forwarded, if
the set consists of more than a single triple than a join ordering (further discussed
below) is used and results of individual query results are joined and then returned.
Each triple pattern corresponds to the respective node in the graph we constructed in
the beginning. While matching; each node is applied to the database and results are
retrieved in a single range scan. For more than a single variable in query tree, each
variable binding requires one single scan.
Duplicates are eliminated by using an aggregation operator when a distinct clauseis used in query. Finally the ids are transformed back into strings by using the
mapping dictionary of identifiers.
Optimizing join ordering:
Join ordering is one of the most important issue in optimizing query plans. Many
methods exist for solving this issue however almost none of them have tried to solve
the demanding properties of joins created by intrinsic characteristics of RDF and
SPARQL. The three noted properties or requirements observed by [7, 8] are:
-
8/3/2019 Rdf, Sw, Sparql Final
16/18
16 Muntazir Mehdi
Sub queries of SPARQL query tend to be star-shaped, for combining severalattribute like properties of same entity. Therefore they require a strategy which
focuses more on bushy trees rather than left-deep or right-deep trees.
The occurrences of these star joins happen to be on the nodes of long join pathmostly on start or end of the path. More than 10 or more joins can be easily lead
by a SPARQL query. Therefore shift to heuristic approximation or fast plan
enumeration would produce exact optimization.
Since a very strong set of triple indexes have been produced and stored in thedatabase, hence these indexes should be used with their full advantage; which
requires extensive use of joins but keeping in mind preserving the orders in
creation of join plans.
All of the above mentioned properties rule out the most notable methods used
earlier for optimizing the query plan. The first property will disable all those methods
which generate star shaped chains. The second property restricts the use of
transformation based top-down enumeration allowing only use of a bottom-up
method. The third property rules out the use of sampling-based plan enumeration as
they have the lowest chances of producing query plans in proper order preserved for
more than 10 joins.
The proposed solution which results in exact optimization of the query plans
addressing all three above mentioned properties uses the bottom-up dynamic
programming framework of [14]. The technique is further discussed in Section 4.2 of
[7, 8].
Handling Disjunctive Queries:
SPARQL has support for both conjunctive and disjunctive query types. RDF-3Xengine doesnt very much focus on disjunctive queries however it supports
optimization of these queries at some level. The UNION expression of SPARQL
results the union of the bindings generated by two or more groups of patterns applied.
The OPTIONAL expression returns the binding of the pattern group in case there
exists a result or returns NULL in case there is no result. In any case both UNION and
OPTIONAL expressions are considered as nested sub queries of SPARQL first for the
sake of optimization. First these nested sub queries are optimized and then these
optimized sub queries are considered as base relation to optimize the outer query.
The RDF-3X engine also has the capability of preserving the cardinality. While
SPARQL query, when optimized with RDF-3X optimizer can result in many records,
the standard SPARQL semantics demands that the right number of bindings areproduced so one has to take care of duplicates generated after the query has been
executed. This is done by scanning the indexes, those indexes which are not
aggregated will produce multiplicity of 1, while those indexes which have been
aggregated will result the count in shape of multiplicity which we already stored.
The RDF-3X engine, due to its complex algebraic operators has some very
cumbersome implementation issues, however it defends its worth by providing 2
concrete benefits i.e. its a RISC-Style implementation and when compared to other
systems, the query execution time has a drastic performance difference.
-
8/3/2019 Rdf, Sw, Sparql Final
17/18
RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 17
4 ConclusionIn this paper we discussed the upcoming Semantic Web that has proven to be a
necessity for current Web Architecture. We also saw some of the most important
standards that build up the power of Semantic Web. We further discussed RDF and
SPARQL at very basic level and understood the basic syntax and structure.
After that we explained few of the available techniques for storing RDF data and
noticed them with respect to the query performance. We first had a look into the most
generic architecture which can be followed for Semantic Web data management. We
went in details of constituting parts of the Sesame Architecture. Than we noticed the
problems that may arise while storing data in a single table and explained a technique
which is more advanced than the currently famous Property Tables. We saw in
vertically partitioned approach the possibility of increasing the performance whenRDF data was queried and an alternative to store RDF data other than Property Table.
In last we discussed a totally different type of implementation of an engine that was
able to improve storing RDF data effectively, parse the query to optimize its basic
plan and saw the efficient querying. This engine implementation used RISC-Style
architecture for storing RDF data and used a very complex set of algebraic operators
to optimize the query.
5 References[1] T. Berners-Lee, J. Hendler, O. Lassila. The Semantic Web. Scientific American, May
17 2001, 34-43.
[2] Graham Klyne, Jeremy J. Carroll, Brian McBride. Resource Description Framework
(RDF): Concepts and Abstract Syntax, W3C Recommendation. 2004.
[3] Eric Prud'hommeaux, Andy Seaborne. SPARQL Query Language for RDF. W3C
Recommendation. 2008.
[4] Jeen Broekstra, Arjohn Kampman, Frank van Harmelen. Sesame: A Generic
Architecture for storing and querying RDF and RDF Schema. First International Semantic
Web Conference Sardinia, Italy, June 912, 2002, Pages: 54-68.
[5] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach. Scalable
Semantic Web Data Management using Vertical Partitioning. VLDB '07 Proceedings of the
33rd international conference on Very large data bases, 2007, Pages: 411-422.
[6] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach. SW-Store: a
vertically partitioned DBMS for Semantic Web data management. The VLDB Journal - The
International Journal on Very Large Data Bases Volume 18 Issue 2, April 2009, Pages: 385-
406.[7] Thomas Neumann, Gerhard Weikum. RDF-3X: a RISC-style Engine for RDF.
Proceedings of the VLDB Endowment Volume 1 Issue 1, August 2008, Pages: 647-659.
[8] Thomas Neumann, Gerhard Weikum. The RDF-3X Engine for scalable management
of RDF Data. The VLDB Journal - The International Journal on Very Large Data Bases
Volume 19 Issue 1, February 2010, Pages: 91-113.
[9] G. P. Copeland and S. N. Khoshafian. A decomposition storage model. In proceeding
of SIGMOD, pages: 268-279, 1985.
-
8/3/2019 Rdf, Sw, Sparql Final
18/18
18 Muntazir Mehdi
[10] Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis.
RQL: a declarative query language for RDF. WWW '02 Proceedings of the 11th
international conference on World Wide Web, Pages: 592603.
[11] K. Wilkinson. Jena property table implementation. In SSWS, 2006.
[12] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds. Efficient RDF Storage and Retrieval
in Jena2. In SWDB, pages: 131-150, 2003.
[13] E. I. Chong et al. An efficient sql-based rdf querying scheme. In VLDB, 2005.
[14] G. Moerkotte, Thomas Neumann. Analysis of two existing and one new dynamic
programming algorithm for the generation of optimal bushy join trees without cross
products. In VLDB, 2006.
[15] David Becker, Tim Berners Lee. Turtle Terse RDF triple language. W3C
recommendation, 2011.