Web Data Management in RDF Age

Web Data Management in RDF Age

M. Tamer Ozsu

University of WaterlooDavid R. Cheriton School of Computer Science

1Inria/2014-10-01

AcknowledgementsThis presentation draws upon collaborative research anddiscussions with the following colleagues (in alphabetical order)

Gunes Aluc, University of Waterloo

Khuzaima Daudjee, University of Waterloo

Olaf Hartig, University of Waterloo

Lei Chen, Hong Kong University of Science & Technology

Lei Zou, Peking University

3Inria/2014-10-01

Web Data Management

A long term research interest in the DB community

2000 2004

2011 20114Inria/2014-10-01

Interest Due to Properties of Web Data

Lack of a schema

Data is at best “semi-structured”Missing data, additional attributes, similar data but notidentical

Volatility

Changes frequentlyMay conform to one schema now, but not later

Scale

Does it make sense to talk about a schema for Web?How do you capture “everything”?

Querying difficulty

What is the user language?What are the primitives?Arent search engines or metasearch engines sufficient?

5Inria/2014-10-01

More Recent Approaches to Web Querying

Fusion TablesUsers contribute data in spreadsheet, CVS, KML formatPossible joins between multiple data setsExtensive visualization

8Inria/2014-10-01


Fusion TablesUsers contribute data in spreadsheet, CVS, KML formatPossible joins between multiple data setsExtensive visualization

XMLData exchange languagePrimarily tree based structure

<list title="MOVIES">

<film>

<title>The Shining</title>

<director>Stanley Kubrick</director>

<actor>Jack Nicholson</actor>

</film>

<film>

<title>Spartacus</title>

<director>Stanley Kubrick</director>

</film>

<film>

<title>The Passenger</title>

<actor>Jack Nicholson</actor>

</film>

...

</list>

root

film

title

“The Shining”

director

“Stanley Kubrick”

actor

“Jack Nicholson”

film

...

film

title

“The Passenger”

actor


8Inria/2014-10-01


Fusion Tables

Users contribute data in spreadsheet, CVS, KML formatPossible joins between multiple data setsExtensive visualization

XML

Data exchange languagePrimarily tree based structure

RDF (Resource Description Framework) & SPARQL

W3C recommendationSimple, self-descriptive modelBuilding block of semantic web & Linked Open Data (LOD)

8Inria/2014-10-01

RDF Data Volumes . . .

. . . are growing – and fast

Linked data cloud currently consists of 325 datasets with>25B triplesSize almost doubling every year

11Inria/2014-10-01




As of March 2009

LinkedCTReactome

Taxonomy

KEGG

PubMed

GeneID

Pfam

UniProt

OMIM

PDB

SymbolChEBI

Daily Med

Disea-some

CAS

HGNC

InterPro

Drug Bank

UniParc

UniRef

ProDom

PROSITE

Gene Ontology

HomoloGene

PubChem

MGI

UniSTS

GEOSpecies

Jamendo

BBCProgramm

es

Music-brainz

Magna-tune

BBCLater +TOTP

SurgeRadio

MySpaceWrapper

Audio-Scrobbler

LinkedMDB

BBCJohnPeel

BBCPlaycount

Data

Gov-Track

US Census Data

riese

Geo-names

lingvoj

World Fact-book

Euro-stat

IRIT Toulouse

SWConference

Corpus

RDF Book Mashup

Project Guten-berg

DBLPHannover

DBLPBerlin

LAAS- CNRS

Buda-pestBME

IEEE

IBM

Resex

Pisa

New-castle

RAE 2001

CiteSeer

ACM

DBLP RKB

Explorer

eprints

LIBRIS

SemanticWeb.org Eurécom

ECS South-ampton

RevyuSIOCSites

Doap-space

Flickrexporter

FOAFprofiles

flickrwrappr

CrunchBase

Sem-Web-

Central

Open-Guides

Wiki-company

QDOS

Pub Guide

Open Calais

RDF ohloh

W3CWordNet

OpenCyc

UMBEL

Yago

DBpediaFreebase

Virtuoso Sponger

March ’09:89 datasets

11Inria/2014-10-01

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod-cloud.net/




As of September 2010

MusicBrainz

(zitgist)

P20

YAGO

World Fact-book (FUB)

WordNet (W3C)

WordNet(VUA)

VIVO UFVIVO

Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UMBEL

UK Post-codes

legislation.gov.uk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdata.gov

.uk

totl.net

Tele-graphis

TCMGeneDIT

TaxonConcept

The Open Library (Talis)

t4gm

Surge Radio

STW

RAMEAU SH

statisticsdata.gov

.uk

St. Andrews Resource

Lists

ECS South-ampton EPrints

Semantic CrunchBase

semanticweb.org

SemanticXBRL

SWDog Food

rdfabout US SEC

Wiki

UN/LOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAAS

KISTIJISC

IRIT

IEEE

IBM

Eurécom

ERA

ePrints

dotAC

DEPLOY

DBLP (RKB

Explorer)

Course-ware

CORDIS

CiteSeer

Budapest

ACM

riese

Revyu

researchdata.gov

.uk

referencedata.gov

.uk

Recht-spraak.

nl

RDFohloh

Last.FM (rdfize)

RDF Book

Mashup

PSH

ProductDB

PBAC

Poké-pédia

Ord-nance Survey

Openly Local

The Open Library

OpenCyc

OpenCalais

OpenEI

New York

Times

NTU Resource

Lists

NDL subjects

MARC Codes List

Man-chesterReading

Lists

Lotico

The London Gazette

LOIUS

lobidResources

lobidOrgani-sations

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

Linked Open

Numbers

lingvoj

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Good-win

Family

Jamendo

iServe

NSZL Catalog

GovTrack

GESIS

GeoSpecies

GeoNames

GeoLinkedData(es)

GTAA

STITCHSIDER

Project Guten-berg (FUB)

MediCare

Euro-stat

(FUB)

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

Freebase

flickr wrappr

Fishes of Texas

FanHubz

Event-Media

EUTC Produc-

tions

Eurostat

EUNIS

ESD stan-dards

Popula-tion (En-AKTing)

NHS (EnAKTing)

Mortality (En-

AKTing)Energy

(En-AKTing)

CO2(En-

AKTing)

educationdata.gov

.uk

ECS South-ampton

Gem. Norm-datei

datadcs

MySpace(DBTune)

MusicBrainz

(DBTune)

Magna-tune

John Peel(DB

Tune)

classical(DB

Tune)

Audio-scrobbler (DBTune)

Last.fmArtists

(DBTune)

DBTropes

dbpedia lite

DBpedia

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Discogs(Data In-cubator)

Climbing

Linked Data for Intervals

Cornetto

Chronic-ling

America

Chem2Bio2RDF

biz.data.

gov.uk

UniSTS

UniRef

UniPath-way

UniParc

Taxo-nomy

UniProt

SGD

Reactome

PubMed

PubChem

PRO-SITE

ProDom

Pfam PDB

OMIM

OBO

MGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Cpd

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

GenBank

ChEBI

CAS

Affy-metrix

BibBaseBBC

Wildlife Finder

BBC Program

mesBBC

Music

rdfaboutUS Census

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

September ’10:203 datasets

11Inria/2014-10-01





As of September 2011

MusicBrainz

(zitgist)

P20

Turismo de

Zaragoza

yovisto

Yahoo! Geo

Planet

YAGO

World Fact-book

El ViajeroTourism

WordNet (W3C)

WordNet (VUA)

VIVO UF

VIVO Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UniRef

UniProt

UMBEL

UK Post-codes

legislationdata.gov.uk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdata.gov.

uk

Traffic Scotland

theses.fr

Thesau-rus W

totl.net

Tele-graphis

TCMGeneDIT

TaxonConcept

Open Library (Talis)

tags2con delicious

t4gminfo

Swedish Open

Cultural Heritage

Surge Radio

Sudoc

STW

RAMEAU SH

statisticsdata.gov.

uk

St. Andrews Resource

Lists

ECS South-ampton EPrints

SSW Thesaur

us

SmartLink

Slideshare2RDF

semanticweb.org

SemanticTweet

Semantic XBRL

SWDog Food

Source Code Ecosystem Linked Data

US SEC (rdfabout)

Sears

Scotland Geo-

graphy

ScotlandPupils &Exams

Scholaro-meter

WordNet (RKB

Explorer)

Wiki

UN/LOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAASKISTI

JISC

IRIT

IEEE

IBM

Eurécom

ERA

ePrints dotAC

DEPLOY

DBLP (RKB

Explorer)

Crime Reports

UK

Course-ware

CORDIS (RKB

Explorer)CiteSeer

Budapest

ACM

riese

Revyu

researchdata.gov.

ukRen. Energy Genera-

tors

referencedata.gov.

uk

Recht-spraak.

nl

RDFohloh

Last.FM (rdfize)

RDF Book

Mashup

Rådata nå!

PSH

Product Types

Ontology

ProductDB

PBAC

Poké-pédia

patentsdata.go

v.uk

OxPoints

Ord-nance Survey

Openly Local

Open Library

OpenCyc

Open Corpo-rates

OpenCalais

OpenEI

Open Election

Data Project

OpenData

Thesau-rus

Ontos News Portal

OGOLOD

JanusAMP

Ocean Drilling Codices

New York

Times

NVD

ntnusc

NTU Resource

Lists

Norwe-gian

MeSH

NDL subjects

ndlna

myExperi-ment

Italian Museums

medu-cator

MARC Codes List

Man-chester Reading

Lists

Lotico

Weather Stations

London Gazette

LOIUS

Linked Open Colors

lobidResources

lobidOrgani-sations

LEM

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

LinkedUser

FeedbackLOV

Linked Open

Numbers

LODE

Eurostat (OntologyCentral)

Linked EDGAR

(OntologyCentral)

Linked Crunch-

base

lingvoj

Lichfield Spen-ding

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Klapp-stuhl-club

Good-win

Family

National Radio-activity

JP

Jamendo (DBtune)

Italian public

schools

ISTAT Immi-gration

iServe

IdRef Sudoc

NSZL Catalog

Hellenic PD

Hellenic FBD

PiedmontAccomo-dations

GovTrack

GovWILD

GoogleArt

wrapper

gnoss

GESIS

GeoWordNet

GeoSpecies

GeoNames

GeoLinkedData

GEMET

GTAA

STITCH

SIDER

Project Guten-berg

MediCare

Euro-stat

(FUB)

EURES

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

CORDIS(FUB)

Freebase

flickr wrappr

Fishes of Texas

Finnish Munici-palities

ChEMBL

FanHubz

EventMedia

EUTC Produc-

tions

Eurostat

Europeana

EUNIS

EU Insti-

tutions

ESD stan-dards

EARTh

Enipedia

Popula-tion (En-AKTing)

NHS(En-

AKTing) Mortality(En-

AKTing)

Energy (En-

AKTing)

Crime(En-

AKTing)

CO2 Emission

(En-AKTing)

EEA

SISVU

education.data.g

ov.uk

ECS South-ampton

ECCO-TCP

GND

Didactalia

DDC Deutsche Bio-

graphie

datadcs

MusicBrainz

(DBTune)

Magna-tune

John Peel

(DBTune)

Classical (DB

Tune)

AudioScrobbler (DBTune)

Last.FM artists

(DBTune)

DBTropes

Portu-guese

DBpedia

dbpedia lite

Greek DBpedia

DBpedia

data-open-ac-uk

SMCJournals

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Metoffice Weather Forecasts

Discogs (Data

Incubator)

Climbing

data.gov.uk intervals

Data Gov.ie

databnf.fr

Cornetto

reegle

Chronic-ling

America

Chem2Bio2RDF

Calames

businessdata.gov.

uk

Bricklink

Brazilian Poli-

ticians

BNB

UniSTS

UniPathway

UniParc

Taxonomy

UniProt(Bio2RDF)

SGD

Reactome

PubMedPub

Chem

PRO-SITE

ProDom

Pfam

PDB

OMIMMGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Com-pound

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

Affy-metrix

bible ontology

BibBase

FTS

BBC Wildlife Finder

BBC Program

mes BBC Music

Alpine Ski

Austria

LOCAH

Amster-dam

Museum

AGROVOC

AEMET

US Census (rdfabout)

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

September ’11:295 datasets, 25B

triples

11Inria/2014-10-01





April ’14:1091 datasets, ???

triples

11Inria/2014-10-01

Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of LinkedData Best Practices in Different Topical Domains. In Proc. ISWC, 2014.

Closer Look

12Inria/2014-10-01

Globally Distributed Network of Data

13Inria/2014-10-01

Three Approaches

Data warehousing

Consolidate data in a repository and query it

SPARQL federation

Leverage query services provided by data publishers

Live Linked Data querying

Navigate through LOD by looking up URIs at query executiontime

14Inria/2014-10-01

Outline

1 LOD and RDF Introduction

2 Data Warehousing ApproachRelational ApproachesGraph-Based Approaches

3 SPARQL Federation ApproachDistributed RDF ProcessingSPARQL Endpoint Federation

4 Live Querying ApproachTraversal-based approachesIndex-based approachesHybrid approaches

5 Conclusions

15Inria/2014-10-01

Outline





5 Conclusions

16Inria/2014-10-01

Traditional Hypertext-based Web Access

IMDb WorldBook

Data exposedto the Webvia HTML

17Inria/2014-10-01

Linked Data Publishing Principles

IMDb WorldBook

(http://...linkedmdb.../Shining,releaseDate, 23 May 1980)(http://...linkedmdb.../Shining, filmLocation, http://cia.../UK)(http://...linkedmdb.../29704,actedIn, http://...linkedmdb.../Shining)

...

(http://cia.../UK, hasPopulation, 63230000)...

Shi

ning

UK

Data model: RDFGlobal identifier: URIAccess mechanism: HTTPConnection: data links

18Inria/2014-10-01

RDF Example InstancePrefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/

bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/

Subject Predicate Object

mdb: film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”’mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn

URI Literal

URI

21Inria/2014-10-01

RDF Graph

mdb:film/2014

“1980-05-23”

movie:initial release date

“The Shining”refs:label

bm:books/0743424425

4.7

rev:rating

bm:offers/0743424425amazonOffer

geo:2635167

“United Kingdom”

gn:name

62348447

gn:population

mdb:actor/29704


movie:actor name

mdb:film/3418

“The Passenger”

refs:label

mdb:film/1267

“The Last Tycoon”

refs:label

mdb:director/8476


movie:director name

mdb:film/2685

“A Clockwork Orange”

refs:label

mdb:film/424

“Spartacus”

refs:label

mdb:actor/30013

movie:relatedBook

scam:hasOffer

foaf:based nearmovie:actor

movie:directormovie:actor

movie:actor movie:actor

movie:director movie:director

22Inria/2014-10-01

Linked Data Model [Hartig, 2012]

Web Document

Given a countably infinite set D (documents), a Web of LinkedData is a tuple W = (D, adoc, data) where:

I D ⊆ D,

I adoc is a partial mapping from URIs to D, and

I data is a total mapping from D to finite sets of RDF triples.

23Inria/2014-10-01

Linked Data Model [Hartig, 2012]

Web Document

Given a countably infinite set D (documents), a Web of LinkedData is a tuple W = (D, adoc, data) where:

I D ⊆ D,

I adoc is a partial mapping from URIs to D, and

I data is a total mapping from D to finite sets of RDF triples.

Web of Linked Data

A Web of Linked Data W = (D, adoc, data)contains a data link from document d ∈ D todocument d ′ ∈ D if there exists a URI u suchthat:

I u is mentioned in an RDF triplet ∈ data(d), and

I d ′ = adoc(u).23Inria/2014-10-01

SPARQL Queries

SELECT ?nameWHERE {

?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )

}

?m ?dmovie:director

?name

rdfs:label

?b

movie:relatedBook


movie:director name

?rrev:rating

FILTER(?r > 4.0)

25Inria/2014-10-01

Outline





5 Conclusions

26Inria/2014-10-01

Naıve Triple Store Design

SELECT ?nameWHERE {


}Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn

Easy to implementbut

too many self-joins!

27Inria/2014-10-01

Naıve Triple Store Design

SELECT ?nameWHERE {


}Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn

SELECT T1 . o b j e c tFROM T as T1 , T as T2 , T as T3 ,

T as T4 , T as T5WHERE T1 . p=” r d f s : l a b e l ”AND T2 . p=” movie : r e l a t e d B o o k ”AND T3 . p=” movie : d i r e c t o r ”AND T4 . p=” r e v : r a t i n g ”AND T5 . p=” movie : d i r e c t o r n a m e ”AND T1 . s=T2 . sAND T1 . s=T3 . sAND T2 . o=T4 . sAND T3 . o=T5 . sAND T4 . o > 4 . 0AND T5 . o=” S t a n l e y K u b r i c k ”

Easy to implementbut

too many self-joins!

27Inria/2014-10-01

Existing Solutions

1. Exhaustive indexing

Create indexes for each permutation of the three columnsQuery components become range queries over individualrelations with merge-join to combineExcessive space usage

2. Property table

Each class of objects go to a different table ⇒ similar tonormalized relationsEliminates some of the joins

3. Binary (vertically partitioned) tables

For each property, build a two-column table, containing bothsubject and object, ordered by subjectsCan use merge join (faster)Good for subject-subject joins but does not help withsubject-object joins

28Inria/2014-10-01

Graph-based Approach

Answering SPARQL query ≡ subgraph matching

gStore [Zou et al., 2011, 2014], chameleon-db [Aluc et al.,2013]

?m ?dmovie:director

?name

rdfs:label

?b

movie:relatedBook


movie:director name

?rrev:rating

FILTER(?r > 4.0)

mdb:film/2014

“1980-05-23”



bm:books/0743424425

4.7

rev:rating


geo:2635167


gn:name

62348447

gn:population

mdb:actor/29704


movie:actor name

mdb:film/3418

“The Passenger”

refs:label

mdb:film/1267


refs:label

mdb:director/8476


movie:director name

mdb:film/2685


refs:label

mdb:film/424

“Spartacus”

refs:label

mdb:actor/30013

movie:relatedBook

scam:hasOffer





SubgraphM

atching

29Inria/2014-10-01




?m ?dmovie:director

?name

rdfs:label

?b

movie:relatedBook


movie:director name

?rrev:rating

FILTER(?r > 4.0)

mdb:film/2014

“1980-05-23”



bm:books/0743424425

4.7

rev:rating


geo:2635167


gn:name

62348447

gn:population

mdb:actor/29704


movie:actor name

mdb:film/3418

“The Passenger”

refs:label

mdb:film/1267


refs:label

mdb:director/8476


movie:director name

mdb:film/2685


refs:label

mdb:film/424

“Spartacus”

refs:label

mdb:actor/30013

movie:relatedBook

scam:hasOffer





SubgraphM

atching

Advantages

I Maintains the graph structure

I Full set of queries can be handled

29Inria/2014-10-01




?m ?dmovie:director

?name

rdfs:label

?b

movie:relatedBook


movie:director name

?rrev:rating

FILTER(?r > 4.0)

mdb:film/2014

“1980-05-23”



bm:books/0743424425

4.7

rev:rating


geo:2635167


gn:name

62348447

gn:population

mdb:actor/29704


movie:actor name

mdb:film/3418

“The Passenger”

refs:label

mdb:film/1267


refs:label

mdb:director/8476


movie:director name

mdb:film/2685


refs:label

mdb:film/424

“Spartacus”

refs:label

mdb:actor/30013

movie:relatedBook

scam:hasOffer





SubgraphM

atching

Advantages

I Maintains the graph structure

I Full set of queries can be handled

Disadvantages

I Graph pattern matching is expensive

29Inria/2014-10-01

Two Systems

gStore

mdb:film/2014

bm:books/0743424425

mdb:director/8476

mdb:film/424mdb:film/2685

mdb:actor/29804

mdb:film/3418 mdb:film/1267

mdb:actor/30013

movie:ac

tor

moive:director

“Spartacus”moive:director moive:director

“Jack_Nicholson”


rdfs:label

“1980-05-23”

rdfs:label

moive:actor_name

y:hasBudget

y:has_box_office

“22000000#dollar”

movie:relatedBook4.7

rev:rating

bm:offers/0743424425

scam:hasOffer

y:hasBudget y:hasBudget

“21000000#dollar” “26589355#dollar” “12000000#dollar” “60000000#dollar”

y:has_box_office

movie:actor


“The Passager” “The last Tycoon”rdfs:label rdfs:label

“Scatman Crothers”

movie:initial_release_datemoive:actor_name

Fig. 2. An RDF graph G

?x

?y

?z

mdb:movierdf:type

moive:director

“*Jack*”moive:actor_name

y:hasBudget

?budget<30000000Desc, top10

movie:actor

SELECT ?x ?y WHERE{ ?x hasBudget ?budget. ?x rdf:type mdb:movie. ?x movie:director ?y. ?y movie:actor_name ?z. FILTER( regex(str(?z),``Jack'') AND (?budget <30000000) )}ORDER BY ?budgetLIMIT 10

Fig. 3. SPARQL and Query Graph Q

a query signature graph Q⇤, the encoding strategy is analogueto encoding RDF graphs.

The online query evaluation process consists of two steps:filtering and joining. First, we generate the candidates for eachquery node using VS⇤-tree. Then, applying a depth-first searchstrategy, we perform the multi-way join over these candidatelists to find the subgraph matches of SPARQL query Q overRDF graph G.

III. Techniques

In this section, we briefly discuss the techniques used ingStore system; full details are given in elsewhere [5], [6]. Ac-cording to our framework in Section II, we solve the SPARQLquery processing by subgraph matching over the signaturegraph. A key issue is that the proposed encoding and pruningstrategies should support, in a uniform manner, di↵erent kindsof data (such as strings and numeric data), and SPARQLqueries with di↵erent operators . We discuss the encoding andpruning methods in Section III-A. Another technical issue isthe index structure, which is discussed in Section III-B. Wealso present some system-oriented optimization, such as indexcaching strategy and multicore-based query optimization inour system.

A. Encoding Techniques

In gSore, answering SPARQL queries is equivalent tofinding subgraph matches of query graph Q over RDF graphG. If vertex v (in query Q) can match vertex u (in RDF graphG), each neighbor vertex and each adjacent edge of v shouldmatch to some neighbor vertex and some adjacent edge of u.Thus, given a vertex u in G, we encode each of its adjacentedge labels and the corresponding neighbor vertex labels into

System Architecture

Offline Online

Storage

Input Input

RDF Parser

RDF Graph Builder

Encoding Module

VS*-tree builder

RDF data

RDF Triples

RDF Graph

Signature Graph

Key-Value Store

VS*-treeStore

SPARQL Parser

SPARQL Query

Encoding Module

VS*-tree

Query Graph

Filter Module

Join Module

Signature Graph

Node Candidate

Results

Fig. 4. System Architecture

bitstrings, denoted as vS ig(u). We encode query Q with thesame encoding method. Consequently, the match between Qand G can be verified by simply checking the match betweencorresponding encoded bitstrings.

Given a vertex u, we encode each of its adjacent edgese(eLabel, nLabel) into a bitstring, where eLabel is the edgelabel and nLabel is the vertex label. This bitstring is callededge signature (i.e., eS ig(e)). It has two parts: eS ig(e).e,eS ig(e).n. The first part eS ig(e).e (M bits) denotes the edgelabel (i.e., eLabel) and the second part eS ig(e).n (N bits)denotes the neighbor vertex label (i.e., nLabel). The code ofvS ig(u) is formed by performing OR operator over all eS ig(e).Figure 5 illustrates the process.

mdb:film/2014

mdb:director/8476

mdb:actor/29804 moive:director

“1980-05-23”

y:hasBudget

“22000000#dollar”

movie:initial_release_date

movie:actor

e1 rdfs:label "The Shining"e2 movie:initial_release_date "1980-05-23"e3 movie:director mdb:director/8476e4 movie:actor mdb:actor/29704e5 movie:actor mdb:actor/30013e6 y:hasDuration 7140.0$#se7 y:hasBudget 22000000$#$e8 y:hasImdb "0081505"rdfs:label"The Shining"

hasDuration

hasDuration

"0081505"

y:hasImdb

eSig.e eSig.ne1 001000010 000010000101000e2 000110000 000000011100000e3 100100000 000010010000001e4 000010010 001001000000001e5 000010010 001001010000000e6 101000000 000001001100000e7 001010000 000010000001001e8 100010000 001000001001000

nSig 101110010 001011011101001

Fig. 5. Encoding Technique

1) Computing eS ig(e).e: Given an RDF repository, let |P|denote the number of di↵erent properties. If |P| is small, weset |eS ig(e).e| = |P|, where |eS ig(e).e| denotes the length ofthe bitstring, and build a 1-to-1 mapping between the propertyand the bit position. If |P| is large, we resort to the hashingtechnique. Let |eS ig(e).e| = M. Using an appropriate hashfunction, we set m out of M bits in eS ig(e).e to be ‘1’.

chameleon-db

Structural Index

...

Vertex Index

Spill Index

Clu

ster

Ind

ex

Sto

rag

eS

yst

em Sto

rag

eA

dvis

or

QueryEngine Plan Generation Evaluation

31Inria/2014-10-01

gStore

12,000 lines of C++ code under Linux(plus code for SPARQL parser)

mdb:film/2014

bm:books/0743424425

mdb:director/8476

mdb:film/424mdb:film/2685

mdb:actor/29804

mdb:film/3418 mdb:film/1267

mdb:actor/30013

movie:ac

tor

moive:director

“Spartacus”moive:director moive:director

“Jack_Nicholson”


rdfs:label

“1980-05-23”

rdfs:label

moive:actor_name

y:hasBudget

y:has_box_office

“22000000#dollar”

movie:relatedBook4.7

rev:rating

bm:offers/0743424425

scam:hasOffer

y:hasBudget y:hasBudget

“21000000#dollar” “26589355#dollar” “12000000#dollar” “60000000#dollar”

y:has_box_office

movie:actor


“The Passager” “The last Tycoon”rdfs:label rdfs:label

“Scatman Crothers”

movie:initial_release_datemoive:actor_name

Fig. 2. An RDF graph G

?x

?y

?z

mdb:movierdf:type

moive:director

“*Jack*”moive:actor_name

y:hasBudget

?budget<30000000Desc, top10

movie:actor

SELECT ?x ?y WHERE{ ?x hasBudget ?budget. ?x rdf:type mdb:movie. ?x movie:director ?y. ?y movie:actor_name ?z. FILTER( regex(str(?z),``Jack'') AND (?budget <30000000) )}ORDER BY ?budgetLIMIT 10

Fig. 3. SPARQL and Query Graph Q

a query signature graph Q⇤, the encoding strategy is analogueto encoding RDF graphs.

The online query evaluation process consists of two steps:filtering and joining. First, we generate the candidates for eachquery node using VS⇤-tree. Then, applying a depth-first searchstrategy, we perform the multi-way join over these candidatelists to find the subgraph matches of SPARQL query Q overRDF graph G.

III. Techniques

In this section, we briefly discuss the techniques used ingStore system; full details are given in elsewhere [5], [6]. Ac-cording to our framework in Section II, we solve the SPARQLquery processing by subgraph matching over the signaturegraph. A key issue is that the proposed encoding and pruningstrategies should support, in a uniform manner, di↵erent kindsof data (such as strings and numeric data), and SPARQLqueries with di↵erent operators . We discuss the encoding andpruning methods in Section III-A. Another technical issue isthe index structure, which is discussed in Section III-B. Wealso present some system-oriented optimization, such as indexcaching strategy and multicore-based query optimization inour system.

A. Encoding Techniques

In gSore, answering SPARQL queries is equivalent tofinding subgraph matches of query graph Q over RDF graphG. If vertex v (in query Q) can match vertex u (in RDF graphG), each neighbor vertex and each adjacent edge of v shouldmatch to some neighbor vertex and some adjacent edge of u.Thus, given a vertex u in G, we encode each of its adjacentedge labels and the corresponding neighbor vertex labels into

System Architecture

Offline Online

Storage

Input Input

RDF Parser

RDF Graph Builder

Encoding Module

VS*-tree builder

RDF data

RDF Triples

RDF Graph

Signature Graph

Key-Value Store

VS*-treeStore

SPARQL Parser

SPARQL Query

Encoding Module

VS*-tree

Query Graph

Filter Module

Join Module

Signature Graph

Node Candidate

Results

Fig. 4. System Architecture

bitstrings, denoted as vS ig(u). We encode query Q with thesame encoding method. Consequently, the match between Qand G can be verified by simply checking the match betweencorresponding encoded bitstrings.

Given a vertex u, we encode each of its adjacent edgese(eLabel, nLabel) into a bitstring, where eLabel is the edgelabel and nLabel is the vertex label. This bitstring is callededge signature (i.e., eS ig(e)). It has two parts: eS ig(e).e,eS ig(e).n. The first part eS ig(e).e (M bits) denotes the edgelabel (i.e., eLabel) and the second part eS ig(e).n (N bits)denotes the neighbor vertex label (i.e., nLabel). The code ofvS ig(u) is formed by performing OR operator over all eS ig(e).Figure 5 illustrates the process.

mdb:film/2014

mdb:director/8476

mdb:actor/29804 moive:director

“1980-05-23”

y:hasBudget

“22000000#dollar”

movie:initial_release_date

movie:actor

e1 rdfs:label "The Shining"e2 movie:initial_release_date "1980-05-23"e3 movie:director mdb:director/8476e4 movie:actor mdb:actor/29704e5 movie:actor mdb:actor/30013e6 y:hasDuration 7140.0$#se7 y:hasBudget 22000000$#$e8 y:hasImdb "0081505"rdfs:label"The Shining"

hasDuration

hasDuration

"0081505"

y:hasImdb

eSig.e eSig.ne1 001000010 000010000101000e2 000110000 000000011100000e3 100100000 000010010000001e4 000010010 001001000000001e5 000010010 001001010000000e6 101000000 000001001100000e7 001010000 000010000001001e8 100010000 001000001001000

nSig 101110010 001011011101001

Fig. 5. Encoding Technique

1) Computing eS ig(e).e: Given an RDF repository, let |P|denote the number of di↵erent properties. If |P| is small, weset |eS ig(e).e| = |P|, where |eS ig(e).e| denotes the length ofthe bitstring, and build a 1-to-1 mapping between the propertyand the bit position. If |P| is large, we resort to the hashingtechnique. Let |eS ig(e).e| = M. Using an appropriate hashfunction, we set m out of M bits in eS ig(e).e to be ‘1’.

General Approach:

Work directly on the RDF graphand the SPARQL query graph

Use a signature-based encoding ofeach entity and class vertex tospeed up matching

Filter-and-evaluate

Use a false positive algorithmto prune nodes and obtain aset of candidates; then domore detailed evaluation onthose

Use an index (VS∗-tree) over thedata signature graph (has lightmaintenance load) for efficientpruning

32Inria/2014-10-01

1. Encode Q and G to Get Signature GraphsQuery signature graph Q∗

0100 0000 1000 000000010

0000 010010000

Data signature graph G∗

0010 1000

0100 0001

00001

1000 000100010

0000 0100

10000

0000 1000

10000

0000 0010

10000

0000 1001

00100

0001 000101000

0100 1000

01000

1001 1000

01000

0001 0100

01000

33Inria/2014-10-01

2. Filter-and-EvaluateQuery signature graph Q∗

0100 0000 1000 000000010

0000 010010000

Data signature graph G∗

0010 1000

0100 0001

00001

1000 000100010

0000 0100

10000

0000 1000

10000

0000 0010

10000

0000 1001

00100

0001 000101000

0100 1000

01000

1001 1000

01000

0001 0100

01000

Find matches of Q∗ oversignature graph G ∗

Verify each match inRDF graph G

34Inria/2014-10-01

How to Generate Candidate List

Two step process:1. For each node of Q∗ get lists of nodes in G∗ that include that

node.2. Do a multi-way join to get the candidate list

Alternatives:

Sequential scan of G∗

Both steps are inefficient

Use S-treesHeight-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists ofnodes in G∗ that include that node.

• Given query signature q and a set of data signatures S ,find all data signatures si ∈ S where q&si = q

Does not support second step – expensive

VS-tree (and VS∗-tree)Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices

35Inria/2014-10-01

How to Generate Candidate List

Two step process:1. For each node of Q∗ get lists of nodes in G∗ that include that

node.2. Do a multi-way join to get the candidate list

Alternatives:Sequential scan of G∗

Both steps are inefficient

Use S-treesHeight-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists ofnodes in G∗ that include that node.

• Given query signature q and a set of data signatures S ,find all data signatures si ∈ S where q&si = q

Does not support second step – expensive

VS-tree (and VS∗-tree)Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices

35Inria/2014-10-01

S-tree Solution

1111 1111

0110 1111 1101 1101

0000 1110 0110 1001 1100 1001 1001 1101

0000 1000

0000 0100 0000 0010

0010 1000

0100 0001

1000 0001

0000 1001

0100 1000

1001 1000

0001 0100

0001 0001

005

004 006

001

002

003

007

011

008

009

010

d11

d21 d2

2

d31 d3

2 d33 d3

4

G 3

G 2

G 1

1000 00000100 000000010

0000 010010000

Possibly large join space!

36Inria/2014-10-01

S-tree Solution

1111 1111

0110 1111 1101 1101

0000 1110 0110 1001 1100 1001 1001 1101

0000 1000

0000 0100 0000 0010

0010 1000

0100 0001

1000 0001

0000 1001

0100 1000

1001 1000

0001 0100

0001 0001

005

004 006

001

002

003

007

011

008

009

010

d11

d21 d2

2

d31 d3

2 d33 d3

4

G 3

G 2

G 1

1000 00000100 000000010

0000 010010000 002

011


36Inria/2014-10-01

S-tree Solution

1111 1111

0110 1111 1101 1101

0000 1110 0110 1001 1100 1001 1001 1101

0000 1000

0000 0100 0000 0010

0010 1000

0100 0001

1000 0001

0000 1001

0100 1000

1001 1000

0001 0100

0001 0001

005

004 006

001

002

003

007

011

008

009

010

d11

d21 d2

2

d31 d3

2 d33 d3

4

G 3

G 2

G 1

1000 00000100 000000010

0000 010010000 002

011

003

008


36Inria/2014-10-01

S-tree Solution

1111 1111

0110 1111 1101 1101

0000 1110 0110 1001 1100 1001 1001 1101

0000 1000

0000 0100 0000 0010

0010 1000

0100 0001

1000 0001

0000 1001

0100 1000

1001 1000

0001 0100

0001 0001

005

004 006

001

002

003

007

011

008

009

010

d11

d21 d2

2

d31 d3

2 d33 d3

4

G 3

G 2

G 1

1000 00000100 000000010

0000 010010000 002

011

003

008

004

009


36Inria/2014-10-01

S-tree Solution

1111 1111

0110 1111 1101 1101

0000 1110 0110 1001 1100 1001 1001 1101

0000 1000

0000 0100 0000 0010

0010 1000

0100 0001

1000 0001

0000 1001

0100 1000

1001 1000

0001 0100

0001 0001

005

004 006

001

002

003

007

011

008

009

010

d11

d21 d2

2

d31 d3

2 d33 d3

4

G 3

G 2

G 1

1000 00000100 000000010

0000 010010000 002

011

003

008

004

009on on


36Inria/2014-10-01

VS-tree

1111 1111

0110 1111 1101 1101

0000 1110 0110 1001 1100 1001 1001 1101

0000 1000

0000 0100 0000 0010

0010 1000

0100 0001

1000 0001

0000 1001

0100 1000

1001 1000

0001 0100

0001 0001

005

004 006

001

002

003

007

011

008

009

010

d11

d21 d2

2

d31 d3

2 d33 d3

4

G 3

G 2

G 1

11101

1001010001 01100

10000 00001 01100

00010

10000

01000

01000

10000

10000

10000

1000000010

00100

01000

01000

01000

01000

Super edge

37Inria/2014-10-01

Pruning with VS-Tree

1111 1111

0110 1111 1101 1101

0000 1110 0110 1001 1100 1001 1001 1101

0000 1000

0000 0100 0000 0010

0010 1000

0100 0001

1000 0001

0000 1001

0100 1000

1001 1000

0001 0100

0001 0001

005

004 006

001

002

003

007

011

008

009

010

d11

d21 d2

2

d31 d3

2 d33 d3

4

G 3

G 2

G 1

11101

1001010001 01100

10000 00001 01100

00010

10000

01000

01000

10000

10000

10000

1000000010

00100

01000

01000

01000

01000

1000 00000100 000000010

0000 010010000

d32

d33

d33

d34

d31

d34

G 3

00010 10000

01000

003

008

002

011

004

009onon

Reduced join space!

38Inria/2014-10-01

Adaptivity to Workload

Web applications that are supported by RDF datamanagement systems are far more varied than conventionalrelational applications

Data that are being handled are far more heterogeneous

SPARQL is far more flexible in how triple patterns (i.e., theatomic query unit) can be combined

An experiment [Aluc et al., 2014a]

RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store% queries for whichtested system isfastest

20.9 0.0 22.6 56.5 0.0

Total workload exe-cution time (hours)

27.1 20.9 20.8 38.6 72.2

Mean (per query)execution time (sec-onds)

7.8 6.0 6.0 11.1 20.7

39Inria/2014-10-01

Adaptivity to Workload

Web applications that are supported by RDF datamanagement systems are far more varied than conventionalrelational applications

Data that are being handled are far more heterogeneous

SPARQL is far more flexible in how triple patterns (i.e., theatomic query unit) can be combined

An experiment [Aluc et al., 2014a]

RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store% queries for whichtested system isfastest

20.9 0.0 22.6 56.5 0.0

Total workload exe-cution time (hours)

27.1 20.9 20.8 38.6 72.2

Mean (per query)execution time (sec-onds)

7.8 6.0 6.0 11.1 20.7

Summary of Experiments

I No single system is a sole winner across all queries

I No single system is the sole loser across all queries, either

I There can be 2–5 orders of magnitude difference in the performance (i.e., queryexecution time) between the best and the worst system for a given query

I The winner in one query may timeout in another

I Performance difference widens as dataset size increases

39Inria/2014-10-01

Group-by-Query Approach

Tamer Post23571hasPost

OlaftaggedIn

UWaterlooworksAt

Tamer Post23hasPost

Boblikes

UWaterlooworksAt

Post2hasPost taggedIn

Tamer Post23hasPost

BobtaggedIn

UWaterlooworksAt

Post2hasPost favourites

40Inria/2014-10-01

Challenges

Group-by-query clusters (a) do not have fixed size, (b) containsame set of attributes

1. Workload time analysis

2. Updating the physical layout

3. Partial indexing

Type-A,robust

Type-C,robust

Type-A,adaptable

Type-B,adaptable

Type-B,

adaptable

Type-B,

adaptable

Type-B,adaptable T

ype-C,

adaptable

41Inria/2014-10-01

Challenges




3. Partial indexing

Storage System

CacheHash

Function

evict

@t1

· · ·

functionadapts

HashFunction

@tk

41Inria/2014-10-01

Challenges




3. Partial indexing

Storage System

CacheHash

Function

evict

@t1

· · ·

functionadapts

HashFunction

@tk

Index – – – – – – – – – –

SPARQL Query Engine

41Inria/2014-10-01

chameleon-dbPrototype system [Aluc et al., 2013]35,000 lines of code in C++ under Linux (plus code forSPARQL 1.0 parser)

Structural Index

...

Vertex Index

Spill Index

Clu

ster

Inde

xS

tora

geS

yste

m Sto

rage

Adv

isor

QueryEngine Plan Generation Evaluation

42Inria/2014-10-01

Outline





5 Conclusions

45Inria/2014-10-01

Remember the Environment

Distributed environment

Some of the data sites canprocess SPARQL queries –SPARQL endpoints

Not all data sites canprocess queries

Alternatives

Data re-distribution +query decompositionSPARQL federation: justprocess at SPARQLendpointsLive querying (see nextsection)

46Inria/2014-10-01





Alternatives

Data re-distribution +query decomposition

SPARQL federation: justprocess at SPARQLendpointsLive querying (see nextsection)

46Inria/2014-10-01





Alternatives

Data re-distribution +query decompositionSPARQL federation: justprocess at SPARQLendpoints

Live querying (see nextsection)

46Inria/2014-10-01





Alternatives

Data re-distribution +query decompositionSPARQL federation: justprocess at SPARQLendpointsLive querying (see nextsection)

46Inria/2014-10-01

Distributed RDF Processing [Kaoudi and Manolescu, 2014]

Data partitioning approachesRDF data warehouse is partitioned and distributed

RDF data D = {D1, . . . ,Dn}Allocate each Di to a site

Partitioning alternativesTable-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])

SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}

I High performance

I Great for parallelizing centralized RDF data

I May not be possible to re-partition and re-allocate Web data(i.e., LOD)

47Inria/2014-10-01

SPARQL Endpoint Federation

Consider only the SPARQL endpoints for query execution

No data re-partitioning/re-distribution

Consider D = D1 ∪ D2 ∪ . . . ∪ Dn; Di : SPARQL endpoint

AlternativesSPARQL query decomposed Q = {Q1, . . . ,Qk} and executedover {D1, . . . ,Dn} – DARQ, FedX [Schwarte et al., 2011],SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acostaet al., 2011]Partial query evaluation – Distributed gStore

Partial evaluation

I Given function f (s, d) and part of its input s, perform f ’scomputation that only depends on s to get f ′(d)

I Compute f ′(d) when d becomes available

I Applied to, e.g., XML [Buneman et al., 2006]

49Inria/2014-10-01

Distributed SPARQL Using Partial Query EvaluationTwo steps:

1. Evaluate a query at each site to find local matchesQuery is the function and each Di is the known inputInner match or local partial match

2. Assemble the partial matches to get final resultCrossing matchCentralized assemblyDistributed assembly

D1

D2

D3

D4

Crossing match

50Inria/2014-10-01

Outline





5 Conclusions

52Inria/2014-10-01

Live Query Processing

Not all data resides atSPARQL endpoints

Freshness of access to dataimportant

Potentially countably infinitedata sources

Live querying

On-line executionOnly rely on linked dataprinciples

Alternatives

Traversal-basedapproachesIndex-based approachesHybrid approaches

53Inria/2014-10-01

SPARQL Query Semantics in Live Querying

Full-web semantics

Scope of evaluating a SPARQL expression is all Linked DataQuery result completeness cannot be guaranteed by any(terminating) execution

Reachability-based query semantics

Query consists of a SPARQL expression, a set of seed URIs S ,and a reachability condition cScope: all data along paths of data links that satisfy theconditionComputationally feasible

54Inria/2014-10-01

Traversal Approaches

Discover relevant URIs recursivelyby traversing (specific) data linksat query execution runtime [Hartig,2013; Ladwig and Tran, 2011]

Implements reachability-basedquery semantics

Start from a set of seed URIsRecursively follow and discovernew URIs

Important issue is selection of seedURIs

Retrieved data serves to discovernew URIs and to construct result

55Inria/2014-10-01







Advantages

Easy to implementNo data structure to maintain

55Inria/2014-10-01







Advantages

Easy to implementNo data structure to maintain

Disadvantages

Possibilities for parallelized data retrieval are limited (can do asmuch as parallel crawling)Repeated data retrieval introduces significant query latency

55Inria/2014-10-01

Index Approaches

Use pre-populated index to determine relevant URIs (and toavoid as many irrelevant ones as possible)

Different index keys possible; e.g., triple patterns [Umbrichet al., 2011]

Index entries a set of URIsIndexed URIs may appear multiple times (i.e., associated withmultiple index keys)Each URI in such an entry may be paired with a cardinality(utilized for source ranking)

Key: tp Entry: {uri1, uri2, , urin}

GET urii

57Inria/2014-10-01

Index Approaches





GET urii

Advantages

Data retrieval can be fully parallelizedReduces the impact of data retrieval on query execution time

57Inria/2014-10-01

Index Approaches





GET urii

Advantages

Data retrieval can be fully parallelizedReduces the impact of data retrieval on query execution time

Disadvantages

Querying can only start after index constructionDepends on what has been selected for the indexFreshness may be an issueIndex maintenance

57Inria/2014-10-01

Hybrid Approach

Perform a traversal-based execution using a prioritized list ofURIs to look up [Ladwig and Tran, 2010]

Initial seed from the pre-populated index

Non-seed URIs are ranked by a function based on informationin the index

New discovered URIs that are not in the index are rankedaccording to number of referring documents

58Inria/2014-10-01

Outline





5 Conclusions

60Inria/2014-10-01

Conclusions

RDF and Linked Object Data seem to have considerablepromise for Web data management

More work needs to be done

Query semanticsAdaptive system designOptimizations – both in data warehousing and distributedenvironmentsLive querying requires significant thought to reduce latency

2014 2011

61Inria/2014-10-01

Conclusions

What I did not talk about:

Not much on general distributed/parallel processing

Not much on SPARQL semantics

Nothing about RDFS – no schema stuff

Nothing about entailment regimes > 0⇒ no reasoning

62Inria/2014-10-01

Thank you!

Research supported by

63Inria/2014-10-01

References I

Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. (1997). TheLorel query language for semistructured data. Int. J. Digit. Libr., 1(1):68–88.

Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., and Ruckhaus, E. (2011).ANAPSID: an adaptive query processing engine for SPARQL endpoints. InProc. 10th Int. Semantic Web Conf., pages 18–34.

Aluc, G., Hartig, O., Ozsu, M. T., and Daudjee, K. (2014a). Diversified stresstesting of RDF data management systems. In Proc. 13th Int. Semantic WebConf. Forthcoming.

Aluc, G., Ozsu, M. T., and Daudjee, K. (2014b). Workload matters: Why RDFdatabases need a new design. Proc. VLDB Endowment, 7(10):837–840.

Aluc, G., Ozsu, M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: aworkload-aware robust RDF data management system. Technical ReportCS-2013-10, University of Waterloo.

Arocena, G. and Mendelzon, A. (1998). Weboql: Restructuring documents,databases and webs. In Proc. 14th Int. Conf. on Data Engineering, pages24–33.

64Inria/2014-10-01

References IIAtre, M., Chaoji, V., Zaki, M. J., and Hendler, J. A. (2010). Matrix “bit”

loaded: A scalable lightweight join query processor for rdf data. In Proc.19th Int. World Wide Web Conf., pages 41–50.

Buneman, P., Cong, G., Fan, W., and Kementsietsidis, A. (2006). Using partialevaluation in distributed query evaluation. In Proc. 32nd Int. Conf. on VeryLarge Data Bases, pages 211–222.

Buneman, P., Davidson, S., Hillebrand, G. G., and Suciu, D. (1996). A querylanguage and optimization techniques for unstructured data. In Proc. ACMSIGMOD Int. Conf. on Management of Data, pages 505–516.

Fernandez, M., Florescu, D., and Levy, A. (1997). A query language for aweb-site management system. ACM SIGMOD Rec., 26(3):4–11.

Gorlitz, O. and Staab, S. (2011). SPLENDID: SPARQL endpoint federationexploiting VOID descriptions. In Proc. 2nd Int. Workshop on ConsumingLinked Data.

Gurajada, S., Seufert, S., Miliaraki, I., and Theobald, M. (2014). TriAD: Adistributed shared-nothing RDF engine based on asynchronous messagepassing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages289–300.

65Inria/2014-10-01

References IIIHartig, O. (2012). SPARQL for a web of linked data: Semantics and

computability. In Proc. 9th Extended Semantic Web Conf., pages 8–23.

Hartig, O. (2013). SQUIN: a traversal based query execution system for theweb of linked data. In Proc. ACM SIGMOD Int. Conf. on Management ofData, pages 1081–1084.

Hartig, O. and Ozsu, M. T. (2014). Optimizing response time oftraversal-based query optimization. In preparation.

Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying oflarge RDF graphs. Proc. VLDB Endowment, 4(11):1123–1134.

Husain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham,B. (2011). Heuristics-based query processing for large RDF graphs usingcloud computing. IEEE Trans. Knowl. and Data Eng., 23(9):1312–1327.

Kaoudi, Z. and Manolescu, I. (2014). RDF in the clouds: A survey. VLDB J.Forthcoming.

Konopnicki, D. and Shmueli, O. (1995). W3QS: A query system for the WorldWide Web. In Proc. 21th Int. Conf. on Very Large Data Bases, pages 54–65.

66Inria/2014-10-01

References IVLadwig, G. and Tran, T. (2010). Linked data query processing strategies. In

Proc. 9th Int. Semantic Web Conf., pages 453–469.

Ladwig, G. and Tran, T. (2011). SIHJoin: Querying remote and local linkeddata. In Proc. 8th Extended Semantic Web Conf., pages 139–153.

Lakshmanan, L. V. S., Sadri, F., and Subramanian, I. N. (1996). A declarativelanguage for querying and restructuring the Web. In Proc. 6th Int.Workshop on Research Issues on Data Eng., pages 12–21.

Lee, K. and Liu, L. (2013). Scaling queries over big rdf graphs with semantichash partitioning. Proc. VLDB Endowment, 6(14):1894–1905.

Mendelzon, A. O., Mihaila, G. A., and Milo, T. (1997). Querying the WorldWide Web. Int. J. Digit. Libr., 1(1):54–67.

Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. (1995). Objectexchange across heterogeneous information sources. In Proc. 11th Int. Conf.on Data Engineering, pages 251–260.

Peng, P., Zou, L., Ozsu, M. T., Chen, L., and Zhao, D. (2014). ProcessingSPARQL queries over linked data – a distributed graph-based approach. Insubmitted for publication.

67Inria/2014-10-01

References VPrasser, F., Kemper, A., and Kuhn, K. A. (2012). Efficient distributed query

processing for autonomous rdf databases. In Proc. 15th Int. Conf. onExtending Database Technology, pages 372–383.

Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011).Fedx: A federation layer for distributed query processing on linked opendata. In Proc. 8th Extended Semantic Web Conf., pages 481–486.

Umbrich, J., Hose, K., Karnstedt, M., Harth, A., and Polleres, A. (2011).Comparing data summaries for processing live queries over linked data.World Wide Web J., 14(5-6):495–544.

Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: Towardsscalable I/O efficient SPARQL query evaluation on the cloud. In Proc. 29thInt. Conf. on Data Engineering, pages 565–576.

Zou, L., Mo, J., Chen, L., Ozsu, M. T., and Zhao, D. (2011). gStore:answering SPARQL queries via subgraph matching. Proc. VLDBEndowment, 4(8):482–493.

Zou, L., Ozsu, M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014).gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565–590.

68Inria/2014-10-01

Web Data Management in RDF Age

Technology

Transcript of Web Data Management in RDF Age