Rdf, Sw, Sparql Final

8/3/2019 Rdf, Sw, Sparql Final

1/18

RDF TRIPLE STORES,

SPARQL AND THE SEMANTIC

WEB.

Muntazir Mehdi

Department of Computer Science

Technical University Kaiserslautern

67653, Kaiserslautern, Germany

[email protected]

Abstract. Current research in the area of the World Wide Web has mainly

focused on the advent of a technology which enables machines to understand

data. This results in a whole new type of Web which contains meaningful

metadata in addition to the linked documents and their relationships, enabling

roaming agents to extract useful information with an automated process. This

new type of web is named Semantic Web. For sake of unified standards all

developments in area of Semantic Web are being handled by World Wide WebConsortium (W3C). This paper briefly discusses some of already existing

standards; first we provide a brief introduction about Resource Description

Framework (RDF), SPARQL: Query Support for RDF and their syntax than we

look into Data Management techniques for RDF triples and finally we conclude

the paper by summarizing the individual parts of this paper.

Keywords: Semantic Web, Resource Description Framework (RDF), SPARQL,

RDF triple stores, RDF Data Management.

1 IntroductionToday the Web focuses only on the syntactic representation of the information.

This information is nothing more than just network of documents linked together in

the form of web pages. This information is very much understandable to humans and

thats itself the biggest drawback of it. Keeping in mind the drastic advancement in

the field of internet, one can clearly see that the internet which once was created as

communication infrastructure to facilitate communication between parties has

evolved into information infrastructure where expectations are to extract information.

This itself has some very important issues to be addressed e.g. What is the proper
mailto:[email protected]:[email protected]:[email protected]


2/18

2 Muntazir Mehdi

information? Where is the proper information? And when is the proper information

needed? The word Semantics literally translates to Meaning of linguistic term.

Semantic Web basically is web of content where web pages are linked with the help

of semantic relation among them thus helping machines to process information in

addition to humans which in turn is the most important improvement as seen in many

writings.

The Semantic Web will bring structure to the meaningful content of web pages,

creating an environment where software agents roaming from page to page can

readily carry out sophisticated tasks for users. [1]

Now that we have seen some idea about the traditional web and semantic web we

can easily answer the question, Why do we need Semantics? But this is not enough

because the basis of Semantic web is still the old traditional web, therefore the issue

of representing information still exists. Knowledge /Information representation is themost important part of Semantic Web. Unlike traditional web where information was

represented with the help of HTML and other scripting languages which only

represented the syntactic features of information, Semantic Web demands for a

language which has the capability to incorporate semantic features so that the

information can be inferred. Resource Description Framework (RDF) is one of the

most popular Semantic Web languages which derives its root from XML. XML which

itself is powerful enough to be used for information representation where a single

domain of information or knowledge can be coded with multiple styles. Coding a

single domain with multiple styles will result in a much larger complexity due to a

wide range of unknown communication participants. Resource Description

Framework (RDF) is a framework and standard data interchange model specified by

the family of World Wide Web Consortium (W3C) for modeling and representinginformation [2]. The basics of Resource Description Framework focus on the

statement From machine readable to machine understandable and have two

important parts i.e. RDF Model and RDF Syntax which are further discussed in

Section 2.1. Figure 1 shows the famous Semantic Web layer cake where Resource

Description Framework (RDF) can be seen as a significant block.

When thinking about Semantic Web one has to consider the data storage and

management. The data management step of Semantic Web hasnt been a famous topic

among researchers but now that the area has matured over time; many researchers

have proposed many ideas about storage and management of Semantic Web data

model i.e. Resource Description Framework RDF [2]. The diverse data models of

Semantic Web demands a totally new way of storing data. In RDF the information is

captured in the form of statements and those statements are represented with the help

of (subject, predicate, object) or (subject, property, value). For example, a simple

statement Technical University Kaiserslautern is located in Kaiserslautern,

Germany can be represented using the directed graph in Figure 2 and can be

represented as (subject: Technical University Kaiserslautern, predicate: isLocatedIn,

object: Kaiserslautern) triple. More triples on the resource/subject i.e. Technical

University Kaiserslautern in our example can be created which will results in a

complete set of information. This information is first broken into statements and then

translated into triples. These triples then can be stored in different ways. In this paper,


3/18

RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 3

in Section 3 we will discuss some of the known data management techniques

including their architecture, effects of querying them and their performance.

Fig. 1. Semantic Web layer cake

Fig. 2. Directed Graph for RDF statement.

Another significant block that can be seen in the Semantic Web layer cake (Figure

1) is the Rules/Query block which has its own importance. Once information

representation and data management steps are completed its a very gruesome task to

fetch information we need specifically. There are multiple ways to query RDF, one of

the known query language support for Resource Description Framework (RDF) is

SPARQL which is a recursive acronym for SPARQL Protocol and RDF QueryLanguage. SPARQL is not the only available query support for RDF, many other

flavors also exist i.e. RQL, SeRQL, TRIPLE, RDQL, N3 and Versa. Many data

management engines and stores have their own query support. SPARQL is considered

to be a key Semantic Web technology and is a W3C Recommendation because of its

capability to query on variety of data sources, whether the data is stored natively as

RDF or it is being viewed as RDF with the help of additional middleware [3]. The

detailed syntax and structure of both RDF and SPARQL are explained in Section 2.1


4/18

4 Muntazir Mehdi

and 2.2 respectively. SPARQL is also briefly discussed in Section 3 where we talk

about the management of RDF data.

2 RDF & SPARQL: Concepts, Syntax and Structure2.1 RDF

We have had a brief introduction about Resource Description Framework (RDF)

above. In the following Section we will have a detailed look at RDF. Since RDF is

composed of two important parts i.e. RDF Model and RDF Syntax, let us start with

explaining those things in little more detail in the following sections.

RDF Basic Concepts

For the sake of understanding let us once again consider a simple example where

we try to state some information about something:

TU Kaiserslautern is located in Kaiserslautern

For human understanding, this statement about TU Kaiserslautern is simply

described in simple English. By looking at above statement one can say that the

statement can be broken down into different parts and for understanding each part the

statement each part of the statement should be identified. In our example we see that

the statement is being made about TU Kaiserslautern which is a university, it is

located in some place and the place is Kaiserslautern . For sake of identification letus reformat the statement and write in other words where TU Kaiserslautern can be

easily and uniquely identified as a standalone entity since there may a lot of

universities which are located in Kaiserslautern.

http://www.uni-kl.de is located in Kaiserslautern

Now let us once again break down the information and see which blocks constitute

to describe the statement.

The statement is made about a thing i.e. http://www.uni-kl.de The statement has a property concerned with the thing it explains i.e. is located in. The property of the thing has a value i.e. Kaiserslautern.

Since the statement has been broken down, each part of the statement can be

individually identified. In our case the thing/resource/subject is http://www.uni-

kl.de, the property/predicate attached to it is isLocatedIn and the value/object for

the property is Kaiserslautern. More information about this resource can be made

again using simple English sentences.

University of Kaiserslautern has a department of Computer Science

http://www.uni-kl.de was founded in july, 1970

Note that all statements made above have information for a single subject i.e. TU

Kaiserslautern but the problem is that the subject is mentioned in three different

ways. The main idea of RDF is to describe resources, where resources have properties


5/18


and those properties have values. RDF uses a specific terminology for dealing with

parts of a certain statement [2]. The part where the statement describes a resource is

called subject, the part of the statement which states the property or characteristic of

subject is called property/predicate and the part of the statement which addresses the

value of the predicate or property is called value/object. For example in our statement:

Subject = TU Kaiserslautern/http://www.uni-kl.de/University ofKaiserslautern.

Predicate/Property = locatedIn/hasDepartment/foundedOn. Object/Value = Kaiserslautern/Computer Science/July, 1970.

As we know that till now we have been talking about human understanding, in

order to make this information processable for machines RDF requires the followingtwo important things [2]:

1.A language that is already processable by machines and can represent andexchange these statements.

2.A system of identifying each part of the statement without any ambiguity toidentify them with resources available on web and is also machine processable.

The World Wide Web has already two solid mechanisms for identification which

are already machine processable. The Uniform Resource Locater (URL) which

specifies the location of resource and Uniform Resource Identifier (URI) which

itself is a super set of URL and can be created by any organization or person

independently. When we look at our example, luckily we have a resource that has aURL identifier i.e. http://www.uni-kl.de but what about resources which have no

web location or URLs e.g. Credit card, human beings, telephone bill. URIs have the

capability to identify such resources which are 1) on the web, 2) not on the web and

3) abstract concepts.

RDF Information can be written easily by anyone independently using XML [2].

RDF defines a specific programming language for the representation of information,

since XML is already a machine processable and exchangeable format, RDF uses a

variation of XML i.e. RDF/XML which follows a simple syntax similar to XML.

There is another relatively new serialization and structure for RDF data representation

named Turtle [15] is available that has become very much famous for SPARQL query

syntax but in this paper we will only discuss RDF/XML.

RDF Model

RDF data can be represented in the form of triples which follow a certain pattern or

can be represented in the form of a directed graph.

Now that we know that a statement can be broken down into parts and these parts

can be identified using URIs, we can use RDF triples to represent the information. An

example of such can be seen as follows:


6/18

6 Muntazir Mehdi

"July,

1970".

The graph representation of such information can be seen in the following figure:

The key representation notation is that the resources are represented using oval

shapes, predicates are represented using directed graph edges and literal values are

represented using rectangles. A single arc represents a single triple; a triple consists of

subject, predicate and object (which can also be a resource). The graph evolves since

objects which serve as resource themselves can have their own properties and

properties.

Another notable thing that can be used while representing information is that the

URIs are mostly long strings, therefore for making it look symmetric and easily

understandable RDF provides with the use of prefixes. This substitution is made using

XML references which are added in the beginning. A fully qualified name of URI is

substituted using XML prefix. A simple example can be seen as follows:

Prefix: ct, namespace URI: http://www.abc.com/customTypeThus the predicate becomes:

ct: isLocatedIn, ct: hasDepartment, ct: foundedOn.

RDF Syntax


7/18


As discussed earlier, RDF uses XML structure for the representation of

information, however the flavor of XML that RDF uses is a totally different

specification. The RDF names it as RDF/XML [2].

For understanding RDF let us look at the code example given below for the

following piece of information that is represented using RDF/XML syntax.

Technical University Kaiserslautern is located in Kaiserslautern. The university

has department of Computer Science. It was founded in July, 1970.

July, 1970

The very first line i.e.

indicates the content following this line is in XML.

The piece of code that says


8/18

8 Muntazir Mehdi

ern" /> and those properties which have their values as another rdf resource will

be represented in the form of .

marks the end of the RDF content.

A single RDF may contain information about more than one resources, all of them

separated by their respective tags.

Above we presented the very basic syntax and structure using an example. A

detailed syntax and specification can be further seen in [2].

2.2 SPARQLA lot of work has been done to develop a query that fulfills the requirements of all

Semantic Web standards. The race is always towards creating a query language that isvery much similar to SQL and has the potential to deal with Semantic Data. Many

query languages have been proposed in literature (all having their own pros and cons);

SPARQL [3] has proved to be a query language that has been a center point for many

researchers.

RDF doesnt only emulate the SQL syntax but it also support full pattern matching,

optional pattern matching, conjunction and disjunction. The only fallback from

SPARQL that can be observed very easily is its inability to alter RDF stored data.

As we already know that the basic idea of RDF is representation of information in

form of RDF triple consisting subject, predicate and object, SPARQL is not an

exception. SPARQL is also built on the same triple pattern.

SPARQL Syntax and Structure

In this Section we will have a look into the basic syntax and structure of SPARQL

query. For understanding let us look at a query example given below:

PREFIX ct:

SELECT ?name

WHERE

{

ct:hasDepartment ?name.

}

Let first see what happens when this query is executed on the RDF data we have

been using till now. The output of the query would be:

If we break down the query mentioned in the code segment above, we will be able

to explain each and every part in detail. The very first line in the code segment begins


9/18


with the keyword PREFIX. When we were dealing with RDF triples, we had to

identify each part of the statement using an identifier, but identifiers tend to be large

strings. Therefore, we used XML namespaces with prefixes through which we were

able to reduce long strings. The keyword mentioned here is equivalent to that. The

first and foremost part is to declare any prefixes. A single query can have multiple

prefixes used wherever necessary.

The Second line of the code segment has SELECT keyword. As we have already

discussed the SPARQL draws its roots from standard SQL. The SELECT keyword

marks the beginning of the query. It has the same concept of that in SQL; it defines

the variable which we want the query to return us also it binds the variable to the

output we expect to receive.

The FROM keyword which is not mentioned in our example since we are using the

local RDF data, once again works in the similar fashion of that of SQL. This keywordidentifies the dataset on which we want our query to be executed. It can identify a

local file as well as a remote file.

In last we have WHERE keyword. As we know both representation of RDF data

i.e. triple & graph. The match between graph or triple representation is made between

the pattern we have and the pattern we specify in the braces after WHERE. A

WHERE clause can have more than one pattern specified to it, each separated by a

dot at the end. The WHERE clause is also optional as in case of SQL and can easily

be omitted.

In the above example and explanation we have seen the very basic syntax and

structure of SPARQL, a detailed specification for SPARQL and more complex

queries with additional query possibilities see [3].

3 Data ManagementIn this Section we will discuss some data management techniques for Resource

Description Framework (RDF) which includes a state of the art relational DBMS data

storage solution for RDF i.e. Sesame [4], a performance enhancing and data model

decomposition [9] technique for storage of RDF data i.e. Vertically Partitioned

Approach [5, 6] and an engine implementation which follows a RISC-Style

architecture for achieving high performance through SPARQL queries applied on

RDF data i.e. RDF-3X [7, 8].

3.1 SesameSesame is a standard framework for processing RDF data. It is an open source java

framework for storing, querying and reasoning about RDF and RDF schemas. It can

be used for both as database storage and java library for developing application to

work with RDF and RDF schema. The implementation of Sesame follows a generic

architecture i.e. Sesame Architecture [4] which is further discussed in this paper

below. The implementation of Sesame has been designed carefully with flexibility to

support variety of storage systems (relational databases, in-memory, file systems) and


10/18

10 Muntazir Mehdi

offers a wide range of tools to developers to utilize the power of RDF and Semantic

Web standards. Sesame also includes support for SPARQL over both local and

remote stores access transparently with same API.

A packaged product and source code for Sesame can be downloaded from

http://www.openrdf.org/download.jsp.

Sesames Architecture Overview

The overall architecture of Sesame can be seen in Figure 3, individual components

are further explained here.

The RDF data i.e. the final output of Sesame in Sesame architecture is stored in a

scalable repository; RDF is stored in various different ways depending on the

selection of repository. A DBMS suits this condition very well. We already know that

there are a wide range of DBMS systems available, each of them having their own

features supporting their usefulness and strength. The Sesame is implemented in a

DBMS-Independent fashion. In order to achieve it, the code specific to DBMS is

concentrated in a single architectural layer i.e. the Storage and Interface Layer

(SAIL). This layer serves as a client to the main functional modules of Sesame. SAIL

is just an API which is responsible for translating the RDF specific requests made by

the functional modules to their specific DBMS. The functional modules are further

discussed below.

The packaging of Sesame is done in a manner that it can be implemented as both

web application and a web service. The packaged implementation is deployed on a

web container supporting java servlets and then can be accessed via HTTP/S or

SOAP. For scalability purposes handlers for each way of communication are added

separately. An additional protocol handler can be added for accessing Sesame via adifferent way of communication. The request router is responsible for receiving

request from protocol handlers and routing the requests to the respective functional

modules and vice versa.
http://www.openrdf.org/download.jsphttp://www.openrdf.org/download.jsphttp://www.openrdf.org/download.jsp


11/18


Fig. 3.Sesames Architecture

Sesames Functional Modules

The Query Module:

The Query module used in Sesames implementation uses RQL [10] but in a

different fashion which corresponds very much to W3C recommendations, with

support of domain and range restrictions to both optional and multiple. Some people

also name it SeRQL acronym for RQL (RDF Query language) for Sesame. But in this

paper we will not consider any specific query language for explaining this module.

The path that is followed by this module while responding to a request is shown in

figure 4. This model carries out two important functions on a query i.e. Parsing and

Optimization. While dealing with a query the module initially parses the query and

creates a query tree model. This tree is then forwarded to an optimizer which creates

an optimized version of the query tree model.

For example a SPARQL query can be translated into an SQL query, optimized

with respect to the underlying DMBS and then forwarded for execution.

Fig. 4. The Query Module flow path

The Admin Module:

The two main functions of the Admin module are to incrementally add RDF(S)

data into repository and clean up the repository. For populating the repository with

information extracted from RDF(S) a simple process is followed. Generally the

RDF(S) data is available online or locally in form of serialized XML (the extensions

may vary; both .xml and .rdf(s) are applicable). Many parsers are available to extract

data from these serialized XML files e.g. Jenna toolkit. The parser receives the XML

file and after parsing the information produces the data in the form of (subject,

predicate, object) or (subject, property, value) RDF triples. The admin module than

communicates with SAIL and inserts the data in to the repository. Reporting of errors

and warnings is also the responsibility of this module.

The RDF Export Module:

The simplest part of the Sesame Architecture is the Export Module. This module is

only responsible for exporting RDF(S) data. Schema information is useful for some

tools and RDF data is useful for some tools and in some cases both schema and data

are useful depending on scenario. Based on the request made, this module has the

capability to export the schema or data or both. After communicating with SAIL the


12/18

12 Muntazir Mehdi

schema receives the triples data and produces a serialized XML formatted file. This

enables Sesame to be integrated with other RDF tools.

3.2 Vertically Partitioned ApproachThe Vertically Partitioned Approach [5, 6] is an alternative approach to Property

Table. In order to understand this approach lets first have a basic understanding of

Property Table.

Usually RDF data is parsed into (subject, predicate, object) or (subject, property

value) triples first and then it is fed into RDBMS. Since many literals are mostly large

strings and it is inefficient to apply a pattern based querying on it; an approach to

reduce them is used to further increase the performance. In this approach a simple

mapping is created between literals which are very long and an identifier table [13]. Asimple example for storing RDBMS data into one table can be seen in figure 5(a).

However the drawback of this simple and straight forward approach is query

processing time it takes to retrieve results from the store. For this purpose, researchers

at Jena Semantic Web toolkit, Jena2 [11, 12] proposed property table concept which

is considerably efficient for query processing. The proposal contains two types of

property table. The first type of property table is known as Clustered Property Table

where clusters of properties which are common to most subjects are grouped together

and a table is formed, the rest of the triples are inserted into a table which is same as

that of RDBMS. An example of this can be seen in figure 5(b). The second type of

property table known as Property-Class Table uses the property part of an RDF triple.

This type of property table creates classes based on properties which are very much

common among subject and groups those subjects into individual tables. Again theleft over triples are stored using the same technique as in RDBMS and Clustered

Property Table. This technique of storing triples with respect to their classes has been

found useful by Jena2 and is also very much effective while storing reified

statements. Reification in Semantic Web is defined as Statement about Statement for

example one statement is Earth revolves round the Sun and another statement that

reifies this statement is Scientists believe that earth revolves round the sun. While

storing reified statements, RDF:Statement is considered as class and the properties are

RDF:Subject, RDF:Property and RDF:Object. Example for Property-Class Table is

also shown in figure 5(c).

Now that we have seen the Property table technique, let us look into the alternative

approach that enhances the query performance by using fully decomposed storage

model [9]. The Vertically Partitioned Approach is a very simple and straight forward

approach where the tables are created by using the unique properties in the data. All

unique properties from triples are extracted and then inserted into respective tables.

The table consists of two columns, first column is the subject and the second column

is for property value. The most interesting and performance enhancing part of this

approach is the sorting that is applied on the subject column of individual tables. This

enables locating the subject quickly hence fast merge joins can be used to construct

the required information about multiple subsets of subjects. An example of this

approach is shown in figure 6.


13/18


The Vertically partitioned approach has several advantages over Property Table

technique. Some of them are listed below:

Support for multi-valued attributes: Those subject which have more than one property

value for a particular property can be easily stored in decomposed storage model. The

technique is to add the values in succession.

Support for Heterogeneous records: This is the biggest advantage of vertically

partitioned approach over property table. When dealing with unstructured data or

poorly structured data there are always possibilities of missing property values among

subjects. The idea here is to simple omit them while populating the table or in simple

words Null values need not to be mentioned anymore.

Fig. 5. RDF Triple data and property table examples

Certainly this approach has its own disadvantages in some scenarios, however it

has been observed that when compared to property table technique this approach has

upper hand. A detailed performance comparison for both of these approaches can be

seen in Section 6 of [5, 6].


14/18

14 Muntazir Mehdi

Fig. 6. Vertically Partitioned Approach example

3.3 RDF-3XRDF-3X as the name suggests is an engine implementation which covers 3 salient

features. 1) The implementation follows a generic solution for implementing storage

of RDF data in a manner that no further tuning should be required, 2) a query

processor and 3) a query optimizer.

Storage and Indexing

Triples Store and Dictionary:

As discussed earlier, the current state of art schema for storing RDF data is

Property Table but here once again the engine will use a simple approach where all

RDF data is extracted in form of (subject, predicate, object) or (subject, property,

value) triples. Once RDF triples are extracted they are stored in a repository. The

repository is custom storage implementation instead of using RDBMS. This supports

the concept of using RISC-Style and design principle. As mentioned earlier in Section

3.2 the costs of directly storing RDF triples in a single table, the engine

implementation defends the criticism that a single table incurs too many self-joins by

creating indexes which prove to be very efficient.

Once again the notion of using a dictionary which is a mapping of large string

literals to an identifier table is used as before. The cost for this would be indexing the

dictionary however this will gain two benefits i.e. 1) compression of triple store and

2) simplification for query processor. All triples are stored in clustered B-tree and the

tree is sorted alphabetically. The use of this data structure will help in conversion of

SPARQL patterns into range scans. Another advantage is when a specific pattern

matching is applied, the binding to every unknown literal can be found in a single

scan in logarithmic amortized time.

Compressed Indexes:

While applying pattern matching on triple store we always rely on the fact that the

pattern is always supplied in the standard format. However in many cases patterns can

have different forms. For sake of producing results in one scan of any supplied pattern

in any order, the engine uses permutations of all three: subject, predicate and object.

This will ultimately result in six different results for a single triple; however the


15/18


engine overcomes the redundancy by applying the compression. The standard

ordering of the pattern is (subject (s), predicate (p), object (o)). The six possibilities

for a single triple then become (SOP, SPO, OSP, OPS, POS, PSO). While storing

each permutation in leaf page of clustered B-Tree, each permutation is first sorted

alphabetically. A detail on compressing the indexes and compression algorithm used

with comparison to other algorithms can be found in [7, 8] in Section 3.2.

Aggregated Indices:

Additional aggregated indexes are created where two out of three columns of a

triple are considered. In other words it can also be said that two entries from a set of

three possible entries are extracted along with a count i.e. the number of occurrence of

this pair in all set of triples. This is done for all six possible permutations and then

stored in the database. The compression is applied once again and the effect of addingthem seems almost negligible. The same is done for a single entry, where a single

column is considered and a count is kept and then stored. The compression once again

enables the effect of storing them negligible. The reason behind using aggregated

indexes is simplifying translation of query. As from many SPARQL query patterns it

can be observed that partial triples are sufficient.

Query Processing and Query Optimization

Translating SPARQL Queries:

In order to optimize the query it is necessary to first transform it into calculus

representation. A query graph representation is constructed which can be used as

relational tuple calculus since it is easier to optimize. Every supplied query is firstparsed and expanded into set of triples. A triple consists of either literal or variable.

The mapping of literals is done using the dictionary concept used earlier and ids are

retrieved.

When supplying conjunctive query; while expanding the query into set of triples if

the query consists of a single triple than a single result is retrieved and forwarded, if

the set consists of more than a single triple than a join ordering (further discussed

below) is used and results of individual query results are joined and then returned.

Each triple pattern corresponds to the respective node in the graph we constructed in

the beginning. While matching; each node is applied to the database and results are

retrieved in a single range scan. For more than a single variable in query tree, each

variable binding requires one single scan.

Duplicates are eliminated by using an aggregation operator when a distinct clauseis used in query. Finally the ids are transformed back into strings by using the

mapping dictionary of identifiers.

Optimizing join ordering:

Join ordering is one of the most important issue in optimizing query plans. Many

methods exist for solving this issue however almost none of them have tried to solve

the demanding properties of joins created by intrinsic characteristics of RDF and

SPARQL. The three noted properties or requirements observed by [7, 8] are:


16/18

16 Muntazir Mehdi

Sub queries of SPARQL query tend to be star-shaped, for combining severalattribute like properties of same entity. Therefore they require a strategy which

focuses more on bushy trees rather than left-deep or right-deep trees.

The occurrences of these star joins happen to be on the nodes of long join pathmostly on start or end of the path. More than 10 or more joins can be easily lead

by a SPARQL query. Therefore shift to heuristic approximation or fast plan

enumeration would produce exact optimization.

Since a very strong set of triple indexes have been produced and stored in thedatabase, hence these indexes should be used with their full advantage; which

requires extensive use of joins but keeping in mind preserving the orders in

creation of join plans.

All of the above mentioned properties rule out the most notable methods used

earlier for optimizing the query plan. The first property will disable all those methods

which generate star shaped chains. The second property restricts the use of

transformation based top-down enumeration allowing only use of a bottom-up

method. The third property rules out the use of sampling-based plan enumeration as

they have the lowest chances of producing query plans in proper order preserved for

more than 10 joins.

The proposed solution which results in exact optimization of the query plans

addressing all three above mentioned properties uses the bottom-up dynamic

programming framework of [14]. The technique is further discussed in Section 4.2 of

[7, 8].

Handling Disjunctive Queries:

SPARQL has support for both conjunctive and disjunctive query types. RDF-3Xengine doesnt very much focus on disjunctive queries however it supports

optimization of these queries at some level. The UNION expression of SPARQL

results the union of the bindings generated by two or more groups of patterns applied.

The OPTIONAL expression returns the binding of the pattern group in case there

exists a result or returns NULL in case there is no result. In any case both UNION and

OPTIONAL expressions are considered as nested sub queries of SPARQL first for the

sake of optimization. First these nested sub queries are optimized and then these

optimized sub queries are considered as base relation to optimize the outer query.

The RDF-3X engine also has the capability of preserving the cardinality. While

SPARQL query, when optimized with RDF-3X optimizer can result in many records,

the standard SPARQL semantics demands that the right number of bindings areproduced so one has to take care of duplicates generated after the query has been

executed. This is done by scanning the indexes, those indexes which are not

aggregated will produce multiplicity of 1, while those indexes which have been

aggregated will result the count in shape of multiplicity which we already stored.

The RDF-3X engine, due to its complex algebraic operators has some very

cumbersome implementation issues, however it defends its worth by providing 2

concrete benefits i.e. its a RISC-Style implementation and when compared to other

systems, the query execution time has a drastic performance difference.


17/18


4 ConclusionIn this paper we discussed the upcoming Semantic Web that has proven to be a

necessity for current Web Architecture. We also saw some of the most important

standards that build up the power of Semantic Web. We further discussed RDF and

SPARQL at very basic level and understood the basic syntax and structure.

After that we explained few of the available techniques for storing RDF data and

noticed them with respect to the query performance. We first had a look into the most

generic architecture which can be followed for Semantic Web data management. We

went in details of constituting parts of the Sesame Architecture. Than we noticed the

problems that may arise while storing data in a single table and explained a technique

which is more advanced than the currently famous Property Tables. We saw in

vertically partitioned approach the possibility of increasing the performance whenRDF data was queried and an alternative to store RDF data other than Property Table.

In last we discussed a totally different type of implementation of an engine that was

able to improve storing RDF data effectively, parse the query to optimize its basic

plan and saw the efficient querying. This engine implementation used RISC-Style

architecture for storing RDF data and used a very complex set of algebraic operators

to optimize the query.

5 References[1] T. Berners-Lee, J. Hendler, O. Lassila. The Semantic Web. Scientific American, May

17 2001, 34-43.

[2] Graham Klyne, Jeremy J. Carroll, Brian McBride. Resource Description Framework

(RDF): Concepts and Abstract Syntax, W3C Recommendation. 2004.

[3] Eric Prud'hommeaux, Andy Seaborne. SPARQL Query Language for RDF. W3C

Recommendation. 2008.

[4] Jeen Broekstra, Arjohn Kampman, Frank van Harmelen. Sesame: A Generic

Architecture for storing and querying RDF and RDF Schema. First International Semantic

Web Conference Sardinia, Italy, June 912, 2002, Pages: 54-68.

[5] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach. Scalable

Semantic Web Data Management using Vertical Partitioning. VLDB '07 Proceedings of the

33rd international conference on Very large data bases, 2007, Pages: 411-422.

[6] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach. SW-Store: a

vertically partitioned DBMS for Semantic Web data management. The VLDB Journal - The

International Journal on Very Large Data Bases Volume 18 Issue 2, April 2009, Pages: 385-

406.[7] Thomas Neumann, Gerhard Weikum. RDF-3X: a RISC-style Engine for RDF.

Proceedings of the VLDB Endowment Volume 1 Issue 1, August 2008, Pages: 647-659.

[8] Thomas Neumann, Gerhard Weikum. The RDF-3X Engine for scalable management

of RDF Data. The VLDB Journal - The International Journal on Very Large Data Bases

Volume 19 Issue 1, February 2010, Pages: 91-113.

[9] G. P. Copeland and S. N. Khoshafian. A decomposition storage model. In proceeding

of SIGMOD, pages: 268-279, 1985.


18/18

18 Muntazir Mehdi

[10] Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis.

RQL: a declarative query language for RDF. WWW '02 Proceedings of the 11th

international conference on World Wide Web, Pages: 592603.

[11] K. Wilkinson. Jena property table implementation. In SSWS, 2006.

[12] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds. Efficient RDF Storage and Retrieval

in Jena2. In SWDB, pages: 131-150, 2003.

[13] E. I. Chong et al. An efficient sql-based rdf querying scheme. In VLDB, 2005.

[14] G. Moerkotte, Thomas Neumann. Analysis of two existing and one new dynamic

programming algorithm for the generation of optimal bushy join trees without cross

products. In VLDB, 2006.

[15] David Becker, Tim Berners Lee. Turtle Terse RDF triple language. W3C

recommendation, 2011.

Rdf, Sw, Sparql Final

Documents

Transcript of Rdf, Sw, Sparql Final