Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real)...

158
Assessing the performance of RDF Engines: Discussing RDF Benchmarks Irini Fundulaki Institute of Computer Science – FORTH, Greece Anastasios Kementsietsidis Google Research, USA 5/27/16 ESWC 2016: Assessing the performance of RDF Engines - Discussing RDF Benchmarks 1

Transcript of Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real)...

Page 1: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

AssessingtheperformanceofRDFEngines:DiscussingRDFBenchmarks

IriniFundulakiInstituteofComputerScience–FORTH,Greece

AnastasiosKementsietsidisGoogleResearch,USA

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 1

Page 2: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

TraditionalWeb:WebofDocuments•  singleinformationspace:globalfilesystem•  designedforhumanconsumption•  documentsaretheprimaryobjectswithaloosestructure•  URLsarethegloballyuniqueIDsandpartoftheretrieval

mechanism•  cannotaskexpressivequeries

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 2

©Hartig

,Cyg

aniac,Bizer,H

ausenb

las,Hea

th

How

toPub

lishLink

edDataon

theWeb

HTML HTML HTML

WebBrowsers WebBrowsers

hyperlinks hyperlinks

Page 3: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

GoingfromtheWebofDocumentstotheWebofData•  Aglobaldatabase•  Designedformachinesfirst,humanslater•  Thingsareprimaryobjectswithawelldefinedstructure•  Typedlinksbetweenthings•  Abilitytoexpressstructuredqueries

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 3

Thing

Thing

Thing

Thing

Thing

Thing

Don’tlinkthedocuments,linkthethings

typedlinks typedlinks

©The

Web

ofL

inke

dData:Tom

Hea

th,

AnIntrod

uctio

ntoLinke

dData

Page 4: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LinkingOpenDatasets(LOD)•  PublishopendataasLinkedDataontheWeb•  Interlinkentitiesbetweenheterogeneousdatasources

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 4

Page 5: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StatusoftheLinkedOpenDataCloud,2007

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 5

Page 6: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StatusoftheLinkedOpenDataCloud,2011

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 6

Page 7: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StatusoftheLinkedOpenDataCloud,2014

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 7

Media

Government

Geographic

Publications

User-generated

Lifesciences

Cross-domain

RDF,acommondatamodel

Morethan31BtriplesinLOD

Links(external):500M

Page 8: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LinkedDatainnumbers(2014)•  StateoftheLODCloud2014,UniversityofManheim

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 8

Domain Datasets % Any SPARQL Dump

Government 183 18.05 61(32.80%) 30.11% 30.65%

Publications 96 9.47 10(10.58%) 9.62% 3.85%

LifeSciences 83 8.19 19(21.35%) 20.22% 16.85%

User-generatedcontent

48 4.73 3(5.4%5) 5.45%

1.82%

Cross-domain 41 4.04 4(9.09%) 4.55% 6.82%

Media 22 2.17 1(2.70%) 0.00% 2.70%

Geographic 21 2.07 8(19.51%) 12.20% 12.20%

SocialWeb 520 51.28 6(1.16%%) 1.16% 0.39%

Total 1014 - 48(5.89%) 4.54% 3.80%

AccessMethods

Page 9: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ProliferationofBigDataStores

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 9

Page 10: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

Many(notalot)RDFStores

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 10

Page 11: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

TheQuestion(s)•  WhicharetheproblemsthatIwishtosolve?•  Whicharetherelevantkeyperformanceindicators?•  Whichisthebehavioroftheexistingenginesw.r.t.thekey

performanceindicators?•  Howcanwepushthesystems

togetbetter?

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 11

Whicharethetool(s)thatIshoulduseformydataandformyusecase?

Page 12: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

TheAnswer:Benchmarkyourengines!

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 12

•  QueryingBenchmarkcomprisesof

–  datasets(syntheticorreal)–  setofsoftwaretools

•  syntheticdatagenerators•  querygenerators

–  performancemetrics,and

–  setofclearexecutionrules•  Standardizedapplicationscenario(s)thatserveasabasisfor

testingsystems

•  Mustincludeaclearsetoffactorstobemeasuredandtheconditionsunderwhichthesystemsshouldbemeasured

Page 13: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 13

•  Benchmarksexist–  Toallowadequatemeasurementsofsystems–  Toprovideevaluationofenginesforreal(orclosetoreal)usecases

•  Providehelp–  DesignersandDeveloperstoassesstheperformanceoftheirtools

–  Userstocomparethedifferentavailabletoolsandevaluatesuitabilityfortheirneeds

–  Researcherstocomparetheirworktoothers•  Leadstoimprovements:–  Vendorscanimprovetheirtechnology–  Researcherscanaddressnewchallenges–  Currentbenchmarkdesigncanbeimprovedtocovernewnecessitiesandapplicationdomains

ImportanceofBenchmarking

Page 14: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

TutorialObjective&Benefits•  Objectives:–  Discussasetofprinciplesandbestpracticesforbenchmarkdevelopment

–  PresentanoverviewofthecurrentworkonbenchmarksforRDFqueryengines

–  Focusonidentifyingresearchchallenges&unexploredresearchdirections

–  Interdisciplinary:SemanticWeb&(alittlebitof)Databases

•  Benefitsfortheaudience–  Academic:Obtainasolidbackground,discovernewresearchdirections

–  Practitioner:findoutwhataretheavailablebenchmarks,advantagesandlimitationsthereof

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 14

Page 15: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

PurposeoftheTutorial

•  Stimulatediscussionsonthefollowingtopics:1.  Howcanonecomeupwiththerightbenchmarkthat

accuratelycapturesusecasesofinterest?

2.  HowcanabenchmarkcapturethefactthatRDFdataoriginatefromamultitudeofformats

! Structured:relationaland/orXMLdatatoRDF

! Unstructured3.  Howcanabenchmarkcapturethedifferentdataandquery

patternsandprovideaconsistentpictureforsystembehavioracrossdifferentapplicationsettings?

4.  Howcanoneselecttherightbenchmarkforhersystem,dataandworkload?

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 15

Page 16: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

Overview•  IntroducingBenchmarks•  AshortdiscussionaboutLinkedData–  ResourceDescriptionFramework(DataModel)–  SPARQL(QueryLanguage)

•  BenchmarkingPrinciples&ChokePoints•  Benchmarks–  Synthetic–  Real–  BenchmarkGenerators

•  Sumup:whatdidwelearntoday?

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 16

Page 17: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

AshortdiscussionaboutLinkedData-ResourceDescriptionFramework(DataModel)-SPARQL(QueryLanguage)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 17

Page 18: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ResourceDescriptionFramework(RDF)•  W3CstandardtorepresentWebdataandmetadata•  genericandsimplegraphbasedmodel•  informationfromheterogeneoussourcesmergesnaturally:–  resourceswiththesameURIdenotethesamenon-informationresource(leadingtotheLinkedDataCloud)

•  structureisaddedusingschemalanguagesandisrepresentedasRDFtriples

•  WebbrowsersuseURIstoretrieveinformation

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 18

Page 19: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ResourceDescriptionFramework(RDF)•  AnRDFtripleisoftheform(s,p,o)where–  sisthesubject:theURIidentifyingthedescribedresource–  oistheobject:caneitherbeasimpleliteralvalueortheURIofanotherresource

–  pisthepredicate:theURIindicatingtherelationbetweensubjectandobject

•  AnRDFgraphisasetoftriples–  Canbeviewedasanodeandedge-labeleddirectedgraph–  Itispublishedindifferentformats

•  RDF-XML,turtle,n3triples,…

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 19

(dbpedia:Good_Day_Sunshine,dbpedia-owl:artist,dbpedia:The_Beatles)

Closetohowpeopleseetheworld(asagraph)!

Page 20: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

AddingSemanticstoRDF•  RDFisageneric,abstractdatamodelfordescribingresources

intheformoftriples•  RDFdoesnotprovidewaysofdefiningclasses,properties,

constraints•  W3CStandardSchemaLanguages– RDFVocabularyDescriptionLanguage(RDFSchema-RDFS)todefineschemavocabularies

– OntologyWebLanguage(OWL)todefineontologies

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 20

Page 21: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

AddingSemanticstoRDF

•  RDFVocabulariesaresetsoftermsusedtodescribenotionsinadomainofinterest

•  AnRDFtermiseitheraClassoraProperty– Objectpropertiesdenoterelationshipsbetweenobjects– Datatypepropertiesdenoteattributesofresources

•  RDFSdesignedtointroduceusefulsemanticstoRDFtriples•  RDFSSchemasarerepresentedasRDFtriples

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 21

"AnRDFVocabularyisaschemacomprisingofclasses,propertiesandrelationshipswhichcanbeusedfor

describingdataandmetadata"

Page 22: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

RDFVocabularyDescriptionLanguage(RDFS)•  Typing:definingclasses,properties,instances•  Relationshipsbetweenclassesandproperties:subsumption•  Constraints:domainandrangeofproperties•  Inferencerulestoentailnew,inferredknowledge

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 22

Subject Predicate Object

t1 dbo:MusicalWork rdfs:subClassOf dbo:Album

t2 dbo:MusicalWork rdfs:domain dbo:artist

t3 dbo:MusicalWork rdfs:range dbo:march

t4 dbr:Seven_Seas_Of_Rye rdf:type dbo:MusicalWork

t5 dbo:Album rdf:type rdf:Class

Page 23: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

RDFSInference

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 23

•  Usedtoentailnewinformationfromtheonethatisexplicitlystatedinthedataset–  Transitiveclosureacrossclassandpropertyhierarchies

–  Transitiveclosurealongthetypeandclass/propertyrelations

•  Twowaystoimplementit:Forward&BackwardReasoning–  ForwardReasoning:closureiscomputedatloadingtime–  BackwardReasoning:closureiscomputedontheflywhenneeded

(P1,rdfs:subPropertyOf,P2),(P2,rdfs:subPropertyOf,P3)

(P1,rdfs:subPropertyOf,P3)R1:

(C1,rdfs:subClassOf,C2),(C2,rdfs:subClassOf,C3)

(C1,rdfs:subClassOf,C3)R2:

(C1,rdfs:subClassOf,C2),(r1,rdf:type,C1)

(r1,rdf:type,C2)R2:

(P1,rdfs:subPropertyOf,P2),(r1,P1,r2)

(r1,P2,r2)R3:

Page 24: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

RDFSInference

•  Transitiveclosurealongthetypeandclass/propertyrelations

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 24

(C1,rdfs:subClassOf,C2),(r1,rdf:type,C1)

(r1,rdf:type,C2)R2:

Subject Predicate Object

t1 dbo:MusicalWork rdfs:subClassOf dbo:Album

t2 dbo:MusicalWork rdfs:domain dbo:artist

t3 dbo:MusicalWork rdfs:range dbo:march

t4 dbr:Seven_Seas_Of_Rye rdf:type dbo:MusicalWork

t5 dbo:Album rdf:type rdf:Class

t6 dbo:MusicalWork rdf:type rdf:Class

Page 25: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPARQL:QueryingRDFData•  SPARQL:W3CStandardLanguageforQueryingLinked

Data•  SPARQL1.0(2008)onlyallowsaccessingthedata(query)•  SPARQL1.1(2013)introduces:–  QueryExtensions:aggregates,sub-queries,negation,expressionsintheSELECTclause,propertypaths,assignment,shortformforCONSTRUCT,expandedsetoffunctionsandoperators

–  Updates:•  Datamanagement:Insert,Delete,Delete/Insert•  Graphmanagement:Create,Load,Clear,Drop,Copy,Move,Add

–  Federationextension:Service,values,servicevariables(informative)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 25

Page 26: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPARQLQueries(1)•  BuildingBlockistheTriplePattern–  RDFtriplewithvariables

•  GroupGraphPatterns–  BuiltthroughinductiveconstructioncombiningsmallerpatternsintomorecomplexonesusingSPARQLoperators

•  Join-similartorelationaljoin

•  Union(UNION)–similartorelationalunion

•  Optional(OPTIONAL)operatorsontriplepatterns–similartorelationalleftouterjoin(introducesnegationinthelanguage)

•  Filteringconditions(FILTER)•  PatternsonNamedGraphs5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 26

Page 27: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPARQLQueries(2)•  Aggregates–  specifyexpressionsovergroupsofsolutions–  Asinstandardsettingsusedwhentheresultiscomputedoveragroupofsolutionsratherthanasinglesolution•  Example:averagevalueofasetofvalues,sumofaset

–  AggregatesdefinedinSPARQL1.1areCOUNT,SUM,MIN,MAX,AVG,GROUP_CONCAT,andSAMPLE.

–  SolutionsaregroupedusingtheGROUPBYclause–  PruningatgrouplevelisperformedwiththeHAVINGclause

•  AdditionalFeatures–  duplicateelimination(DISTINCT)–  orderingresults(ORDERBY)withanoptionalLIMITclause

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 27

Page 28: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPARQLSemantics•  SPARQLsemanticsbasedonPatternMatching

– Queriesdescribesubgraphsofthequeriedgraph

–  SPARQLgraphpatternsdescribethesubgraphstomatch

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 28

IntuitivelyatriplepatterndenotesthetriplesinanRDFgraphthatareofaspecificform

TP1=(?album,dbpedia-owl:artist,dbpedia:The_Beatles)

TP2=(dbpedia_The_Beatles,?property,?object)

matchesallalbumsoftheBeatles

matchesallinformationaboutTheBeatles

Page 29: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPARQLTypesofQueries•  SELECTreturnsorderedmulti-setofvariablebindings–  Bindings:mappingsofvariablestoRDFtermsinthedataset

–  SQL-LikeSyntax

•  ASKcheckswhetheragraphpatternhasatleastonesolution-returnsaBooleanvalue(true/false)

•  CONSTRUCTreturnsanewRDFgraphasspecifiedbythegraphtemplateoftheCONSTRUCTclauseusingthecomputedbindingsfromthequery’sWHEREclause

•  DESCRIBEreturnstheRDFgraphcontainingtheRDFdataabouttherequestedresource

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 29

SELECT?v1,?v2,…WHEREGraphPattern

Page 30: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

QueryingRDFDatawithSPARQL(1)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 30

PREFIXdc:<http://purl.org/dc/elements/1.1/>SELECT?titleWHERE{<http://example.org/book/book1>dc:title?title}

SimpleSELECTquery

PREFIXfoaf:<http://xmlns.com/foaf/0.1/>SELECT?name?mboxWHERE{?xfoaf:name?name.?xfoaf:mbox?mbox.}

JOINQuery

PREFIXfoaf:<http://xmlns.com/foaf/0.1/>SELECT?name?mboxWHERE{?xfoaf:name?name.OPTIONAL{?xfoaf:mbox?mbox}}

OPTIONALOperator

PREFIXdc:<http://purl.org/dc/elements/1.1/>SELECT?titleWHERE{?xdc:title?title.FILTERregex(?title,"^SPARQL")}

REGEXinFILTER

Page 31: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

QueryingRDFDatawithSPARQL(2)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 31

PREFIXfoaf:<http://xmlns.com/foaf/0.1/>PREFIXorg:<http://example.com/ns#>CONSTRUCT{?xfoaf:name?name}WHERE{?xorg:employeeName?name}

PREFIXfoaf:<http://xmlns.com/foaf/0.1/>ASK{?xfoaf:name"Alice"}

“Findthepeoplewholivein“PaloAlto” andhavefoundedorareboardmembersofcompaniesinthesoftwareindustry.Foreachsuchcompany,findtheproductsthatweredevelopedbyit,itsrevenue,andoptionallyitsnumberofemployees.“SELECT*WHERE{?xhome“PaloAlto” .

{?xfounder?y}UNION{?xmember?y}{

?yindustry“Software” .?zdeveloper?y.?yrevenue?n.OPTIONAL{?yemployees?m}.

}}

SPARQL1.1:SPARQLplusAggregates,Sub-

queries,Propertypaths,Negationandmore!

Page 32: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StoringandQueryingRDFdata•  Schemaagnostic–  triplesarestoredinalargetripletablewheretheattributesare(subject,predicateandobject)-“Monolithic”triple-stores

–  Butitcangetabitmoreefficient

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 32

Subject Predicate Object

t1 dbr:Seven_Seas_Of_Rye rdf:type dbo:MusicalWork

t2 dbr:Starman_(song) rdf:type dbo:MusicalWork

t3 dbr:Seven_Seas_Of_Rye dbo:artist dbo:Queen

id URI/Literal

1 dbr:Seven_Seas_Of_Rye

2 dbr:Starman_(song)

3 dbo:MusicalWork

4 dbo:Queen

5 dbo:artist

6 rdf:type

Subject Predicate Object

1 6 3

2 6 3

1 5 4

RDF-3Xmaintains6indexes,namely,SPO,SOP,OSP,OPS,PSO,POS.Toavoidstorageoverhead,indexesarecompressed![NW09]

Page 33: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StoringandQueryingRDFdata•  schemaaware:

–  onetableiscreatedperpropertywithsubjectandobjectattributes(PropertyTables[Wilkinson06])

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 33

Subject Predicate Object

ID1 type BookType

ID1 title “XYZ”

ID1 author “Fox,Joe”

ID1 copyright “2001”

ID2 type CDType

ID2 title “ABC”

ID2 artist “Orr,Tim”

ID2 copyright “1985”

ID2 language “French”

ID3 type BookType

ID3 title “MNO”

ID3 language “English”

ID4 type DVDType

ID4 title “DEF”

ID5 type CDType

ID5 title “GHI”

ID5 copyright “1995”

ID6 type BookType

ID6 copyright “2004”

Subject Type Title copyright

ID1 BookType “XYZ” “2001”

ID2 CDType “ABC” “1985”

ID3 BookType “MNO” NULL

ID4 DVDType “DEF” NULL

ID5 CDType “GHI” “1995”

ID6 BookType NULL “2004”

Subject Predicate Object

ID1 author “Fox,Joe”

ID2 artist “Orr,Tim”

ID2 language “French”

ID3 language “English”

Subject Title Author copyright

ID1 “XYZ” “Fox,Joe” “2001”

ID3 “MNO” NULL NULL

ID6 NULL NULL “2004”

Subject Title artist copyright

ID2 “ABC” “Orr,Tim” “1985”

ID5 “GHI” NULL “1985”

Subject Predicate Object

ID2 language “French”

ID3 language “English”

ID4 type DVDType

ID4 title “DEF”

Booktype

CDType

Property-classTable

Subject Object

… …

… …

ClusteredPropertyTable

Multi-ValueP

Page 34: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StoringandQueryingRDFdata•  VerticallypartitionedRDF[AMM+07]

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 34

Subject Predicate Object

ID1 type BookType

ID1 title “XYZ”

ID1 author “Fox,Joe”

ID1 copyright “2001”

ID2 type CDType

ID2 title “ABC”

ID2 artist “Orr,Tim”

ID2 copyright “1985”

ID2 language “French”

ID3 type BookType

ID3 title “MNO”

ID3 language “English”

ID4 type DVDType

ID4 title “DEF”

ID5 type CDType

ID5 title “GHI”

ID5 copyright “1995”

ID6 type BookType

ID6 copyright “2004”

Subject Object

ID1 BookType

ID2 CDType

ID3 BookType

ID4 DVDType

ID5 CDType

ID6 BookType

Subject Object

ID1 “XYZ”

ID2 “ABC”

ID3 “MNO”

ID4 “DEF”

ID5 “GHI”

Subject Object

ID1 “2001”

ID2 “1985”

ID5 “1995”

ID6 “2004”

Subject Object

ID2 “Orr,Tim”

Subject Object

ID1 “Fox,Joe”

Subject Object

ID2 “French”

ID3 “English”

type

title

copyright

author

artist

language

Togetth

emosto

utofthisp

ar0cular

decompo

si0on

,acolum

n-oriented

DB

MSisrecommen

ded.

Page 35: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ComparisonofStorageTechniques[BDK+13]

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 35

movie released

Google Android

subject object

Google Android

Google developer Android

subject predicate object

LarryPage born “1973”

LarryPage founder Google

Google HQ “MTV”

Google employees 50,000

Google industry Internet

Google industry Software

Google industry Hardware

Triplestore

person born founder

LarryPage “1973 Google

Type-orientedstore

company HQ employees

Google “MTV” 50,000

subject predicate object

Google industry Internet

Google industry Software

Google industry Hardware

subject object

LarryPage “1973”

Predicate-orientedstore

subject object

Google “MTV”

subject object

Google Internet

Google Software

Google Hardware

subject object

LarryPage Google

subject object

Google 50,000

born

founder

HQ

employees

industry

industtry

LarryPage

“1973”

Google

Internet

Software

Hardware

“MTV”HQ

50,000employee

s

samplegraph

Schemadoesnotchangeonupdates

Schemamightchangeonupdates

Columnsareoverloaded

Traditionalrelationalcolumntreatment

Staticmixofoverloadedandnormalcolumns

developer

Page 36: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StoringLinkedData:QueryProcessing•  SchemaAgnostic–  algebraicplanobtainedforaqueryinvolvesalargenumberofselfjoins

–  queriesarefavorablewhenthepredicateisavariable

•  HybridApproachandSchema-aware–  algebraicplancontainsoperationsovertheappropriateproperty/classtables(moreinthespiritofexistingrelationalschemas)

–  savesmanyself-joinsovertripletables–  ifthepredicateisavariable,thenonequeryperproperty/classmustbeexpressed

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 36

Page 37: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

PurposeofanRDFQueryingBenchmark

•  TesttheperformanceofRDFstores–  Independentlyofunderlyingstorageengine–  Independentlyofunderlyinglogicalandphysicalschema–  Independentlyofthequeryactuallyexecutedintheengine•  SPARQLfornativestores•  SQL(SPARQLtranslatedtoSQL)forrelationalstores

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 37

Page 38: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

Overview•  IntroducingBenchmarks•  AshortdiscussionaboutLinkedData–  ResourceDescriptionFramework(DataModel)–  SPARQL(QueryLanguage)

•  BenchmarkingPrinciples&ChokePoints•  Benchmarks–  Synthetic–  Real–  BenchmarkGenerators

•  Sumup:whatdidwelearntoday?

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 38

Page 39: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BenchmarkingPrinciples&ChokePoints

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 39

Page 40: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WhyBenchmarks?•  PerformanceEvaluation–  Thereisnonosinglerecipeonhowtodoitright–  Therearemanywayshowtodoitwrong–  Thereareanumberofbestpracticesbutnobroadlyacceptedstandardonhowtodesignanddevelopabenchmark

•  Questionsasked:– Whatdata/datasetsshouldweuse?– Whichworkload/queriesshouldweconsider?– Whattomeasureandhowtomeasure?

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 40

Page 41: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BenchmarkCategories

•  Micro-benchmarks•  Standardbenchmarks•  Real-lifeapplications

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 41

Page 42: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

MicroBenchmarks

•  Specialized,stand-alonepieceofsoftware•  Isolateoneparticularfunctionalityofalargersystem•  Indatabasesamicrobenchmarktestsasingledatabase

operator–  Selection,Join(andalltypesthereof),Projection,Aggregates,Sub-Queries,…

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 42

Page 43: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

MicroBenchmarks:Advantages•  Veryfocused

–  Testaspecificoperatorofthesystem•  Controllabledata&workload

–  SyntheticandRealDatasets•  Differentvaluerangesandvaluedistributionandcorrelations(mostlyapplicabletostructureddata)

–  Variousdatasizestotacklescalabilityconcerns•  Queries

–  Workloadsofdifferentcomplexity&size•  Complexity:astothetypesofqueryoperatorsandpatterns•  Size:astothenumberofqueryoperatorsinvolved

–  Allowbroadparameterrange(s)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 43

! Usefulfordetailed,in-depthanalysis! Lowsetupthreshold;! Easytorun

Page 44: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

MicroBenchmarks:Disadvantages•  Neglectlargerpicturesincetheydonottestthewholesystem•  Donotconsidertheflowofcostsofspecificoperationstothe

costofthesystem•  Donotmeasuretheimpactofmicro-benchmarkonreal-life

applications•  Difficulttogeneralizetheresults•  Theresultsofmicro-benchmarkscannotbeappliedina

straightforwardmanner•  Micro-benchmarksdonotusestandardizedmetrics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 44

Page 45: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StandardBenchmarks

•  Relational,ObjectOriented,ObjectRelationalDatabaseManagementSystems–  FamilyofTPCBenchmarksforrelationaldatabases

•  XML,XPath,XQuery,– Mbench,XBench,XMach-1,XMark,

•  GeneralComputing–  SPEC

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 45

Page 46: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

StandardBenchmarks:Advantages&Disadvantages•  Advantages

–  Mimicreal-lifescenarios(respondtorealneeds)•  E.g.,TPCisabusinessorientedbenchmark

–  Publiclyavailable–  Welldefined–  Providescalabledatasetsandworkloads–  Metricsarewelldefined

•  Disadvantages–  Outdated(standardizationisalengthyprocess)

•  XQuerytookaround7yearstobecomeastandard•  TPCbenchmarkdefinitionisstillanongoingprocess

–  Verylargeandcomplicatedtorun–  Limiteddatasetvariation(targetaspecifictypeofdata)–  LimitedWorkload(focusesontheapplicationinmind)–  Systemsareoftenoptimizedforthebenchmark(s)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 46

Page 47: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

•  Managementandmethodologicalactivitiesperformedbyagroupofpeople– Management:Organizationalprotocolstocontroltheprocess– Methodological:principles,methodsandstepsforbenchmarkcreation

•  BenchmarkDevelopment–  Rolesandbodies:people/groupsinvolvedinthedevelopment–  Designprinciples:fundamentalrulesthatdirectthedevelopmentofabenchmark

–  Developmentprocess:seriesofstepstodevelopabenchmarkbasedonChokePoints

BenchmarkDevelopmentMethodology

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 47

ChokePoints:thesetoftechnicaldifficultiesthatforcesystemstoimprovetheirperformance

Page 48: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

TheExampleStandardBenchmark:TPC•  TransactionProcessingCouncil(TPC)

–  non-profitcorporationfocusedondevelopingdata-centricbenchmarkstandardsanddisseminatingobjective,verifiableperformancedatatotheindustry

–  goalisto«create,manageandmaintainasetoffairandcomprehensivebenchmarksthatenableend-usersandvendorstoobjectivelyevaluatesystemperformanceunderwelldefinedconsistentandcomparableworkloads»[NPM+12]

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 48

Benchmark Explanation

TPC-C Focusesontransactions.

TPC-DI FocusesonETLprocesses

TPC-DS Decisionsupportsolutionsfor,butnotlimitedto,BigData.

TPC-E On-LineTransactionProcessing(OLTP)workload

TPC-H Decisionsupportbenchmark,adhocqueriesandconcurrentdatamodifications

TPC-VMS VirtualMeasurementSingleSystemSpecificationforrunningandreportingperformancemetricsforvirtualizeddatabases

TPC-xHS measureofhardware,operatingsystemandcommercialApacheHadoopFileSystemAPI

TPX-xV measuretheperformanceofserversrunningdatabaseworkloadsinvirtualmachines.ActiveTP

CBen

chmarks

(201

6)

Page 49: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BenchmarkDevelopmentProcess(1)•  DesignPrinciples[L97]

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 49

Principle Comment

Relevant Thebenchmarkismeaningfulforthetargetdomain

Understandable Thebenchmarkiseasytounderstandanduse

GoodMetrics Themetricsdefinedbythebenchmarkarelinear,orthogonalandmonotonic

Scalable Thebenchmarkisapplicabletoabroadspectrumofhardwareandsoftwareconfigurations

Coverage Thebenchmarkworkloaddoesnotoversimplifythetypicalenvironment

Acceptance Thebenchmarkisrecognizedasrelevantbythemajorityofvendorsandusers

Page 50: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BenchmarkDevelopmentProcess(2)•  BenchmarkingMetrics

–  Performance–  Price/Performance–  Energy/PerformanceMetrics:Energymetrictomeasuretheenergy

consumptionofsystemcomponents

•  TPCPricingspecification–  Providesconsistentmethodologiesforcomputingthepriceofthe

benchmarkedsystem,licensingofsoftware,maintenance,…

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 50

Benchmark Metrics

TPC-C TransactionRate(tpmC),PriceperTransaction($/tmpC)

TPC-E TransactionsperSecond(tpS)

TPC-H CompositeQueryperHourPerformanceMetric(QpH@Size),PriceperCompositeQueryperHourPerformanceMetric($/QpH@Size)

Page 51: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DesirableAttributesofaBenchmark:

•  “Agoodbenchmarkiswritteninahigh-levellanguage,makingitportableacrossdifferentmachines;isrepresentativeofsomeprogrammingstyleorapplication;canbemeasuredeasily;haswidedistribution[W90]”

•  “adomainspecificbenchmarkmustmeetfourimportantcriteria:relevance,portability,simplicity,scalability[G93]”

•  SixdesirableattributesforTPC-C[L97]:relevance,understandability,goodmetrics,scalability,coverage,acceptance

•  FivedesirableattributesinHuppler[H09]:relevance,repeatability,fairness,verifiability,economy

•  BigDataBenchmarking[1]:“asuccessfulbenchmarkshouldbesimpletoimplementandexecute,costeffective,timelyandverifiable”.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 51

Page 52: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DesignPrinciples:DesirableAttributesofaBenchmark

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 52

•  Relevant/Representative:basedonrealisticusecasescenariosandmustreflecttheneedsoftheusecase

•  Understandable/Simple:theresultsandworkloadareeasilyunderstandablebyusers

•  Portable/Fair/Repeatable:nosystembenefitsfromthebenchmark.Mustbedeterministicandprovidea«goldstandard»

•  Metrics:shouldbewelldefinedtobeabletoassessandcomparethesystems.

•  Scalable:datasetsshouldbeintheorderofbillionsof«objects»

•  Verifiable:allowverifiableresultsineachexecution

BenchmarkAttributes

relevant

representative

understandable

simple

portable

fair

repeatable

metrics

scalable

verifiable

Page 53: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DesignofBenchmarkWorkload[Grey93]

•  Designthequeriestotestspecificfeaturesofthequerylanguageortotestspecificdatamanagementapproaches

•  Basethequerymixonspecificrequirementsofrealworldusecases–  Leadstocomplexqueriesthatinvolvemany(different)languagefeatures

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 53

Micro-benchmarks

Domainspecificandstandardbenchmarks

Page 54: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DevelopmentProcess:ChokePoints•  Abenchmarkexposesasystemtoaworkloadandshouldidentify

thetechnicaldifficultiesofthesystemundertest

•  ChokePoints[BNE14]arethosetechnologicalchallengeswhoseresolutionwillsignificantlyimprovetheperformanceofaproduct

•  TPC-H:a20yearsoldbenchmark(supersededbyTPC-DS)butstillinfluentialusingbusiness-orientedqueriesandconcurrentmodifications

•  22queriescapturing(mostof)theaspectsofrelationalqueryprocessing

•  [BNE14]performedananalysisoftheTPC-Hworkloadandidentified28chokepointsgroupedinto6categories

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 54

Page 55: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ChokePointsàlaTPC-H•  CP1:AggregationPerformance

–  Orderedaggregation,smallgroup-bykeys,interestingorders,dependentgroup-bykeys

•  CP2:JoinPerformance –  Largejoins,sparseforeignkeys,richjoinorderoptimization,lateprojection

•  CP3:DataAccessLocality(materializedviews)–  Columnarlocality,physicallocalitybykey,detectingcorrelation

•  CP4:ExpressionCalculation–  RawExpressionArithmetic,ComplexBooleanExpressionsinJoinsand

Selections,StringMatchingPerformance

•  CP5:CorrelatedSub-queries–  Flatteningsub-queries,movingpredicatestoasub-query,overlapbetween

outer-andsub-query

•  CP6:ParallelismandConcurrency–  Queryplanparallelization,workloadmanagement,resultre-use

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 55

Page 56: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ChokePointsàlaRDF

ChokePoint Description

CP1:JOINORDERING

1.  Testsiftheenginecanevaluatethetrade-offsbetweenthetimespenttofindthebestexecutionplanandthequalityoftheoutputplan

2.  Teststheabilityoftheenginetoconsidercardinalityconstraintsexpressedbythedifferentkindsofschemaconstraints(e.g.,functionalandinversefunctionalproperties)

CP2:AGGREGATION

Aggregationsareimplementedwiththeuseofsub-selectsintheSPARQLquery;theoptimizershouldrecognizetheoperationsincludedinthesub-selectsandevaluatethemfirst.

CP3:OPTIONAL&NESTEDOPTIONALCLAUSES

Teststheabilityoftheoptimizertoproduceaplanwheretheexecutionoftheoptionaltriplepatternsisthelasttobeperformedsinceoptionalclausesdonotreducethesizeofintermediateresults.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 56

Page 57: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ChokePointsinRDFBenchmarks

ChokePoint Description

CP4:REASONINGTeststheabilityoftheenginetohandleefficientlyRDFSandOWLconstructsexpressedintheschema

CP5:PARALLELEXECUTIONOFUNIONS

Teststheabilityoftheoptimizertoproduceplanswhereunionsareexecutedinparallel

CP6:FILTERSTeststheabilityoftheenginestoexecuteasearlyaspossiblethosefilterexpressionstoeliminateapossiblylargenumberofintermediateresults

CP7:ORDERINGTeststheabilityoftheenginetochoosequeryplan(s)thatfacilitatetheorderingofresults

CP8:GEO-SPATIALPREDICATES

Teststheabilityofthesystemtohandlequeriesforgeospatialdata

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 57

Page 58: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ChokePointsinRDFBenchmarks

ChokePoint Description

CP9:FULLTEXT Queriesthatinvolvetheevaluationofregularexpressionsondatavaluepropertiesofresources

CP10:DUPLICATEELIMINATION

Teststheabilityofthesystemtoidentifyduplicateentriesandeliminatethemduringthecreationofintermediateresults

CP11:COMPLEXFILTERCONDITIONS

Teststheabilityoftheenginetodealwithnegation,conjunctionanddisjunctionefficiently(i.e.,breakingthefiltersintoconjunctionoffiltersandexecutetheminparallel).

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 58

Page 59: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

QueryCharacteristics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 59

Characteristics

SimplefiltersUnboundpredicates

LIMIT REGEX CONSTRUCT

Complexfilters Negation ORDERBY UNION ASK

>=9TPs OPTIONAL DISTINCT DESCRIBE

•  FilterExpressions•  Joins•  Negation•  Optional•  Aggregates

•  Ordering•  DuplicateElimination•  RegularExpressions•  Union•  Differenttypesofqueries

Page 60: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

Overview•  IntroducingBenchmarks•  AshortdiscussionaboutLinkedData–  ResourceDescriptionFramework(DataModel)–  SPARQL(QueryLanguage)

•  BenchmarkingPrinciples&ChokePoints•  Benchmarks–  Synthetic–  Real–  BenchmarkGenerators

•  Sumup:whatdidwelearntoday?

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 60

Page 61: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ASurveyofRDFBenchmarksSyntheticBenchmarksRealBenchmarksBenchmarkGenerators

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 61

Page 62: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BenchmarkComponents•  Datasets•  Therawmaterialofthebenchmarkagainstwhichtheworkload

willbeevaluated•  Synthetic&RealDatasets

!  Synthetic:Producedwithadatagenerator(thathopefullyproduceddatawithinterestingcharacteristics)

!  Real:Widelyuseddatasetsfromadomainofinterest

•  QueryWorkload•  Setsofqueriesand/orupdatestoevaluatethesystemwith

•  Metrics•  Theperformancemetric(s)thatdeterminethesystemsbehavior

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 62

Page 63: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SyntheticRDFBenchmarks

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 63

Page 64: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LehighUniversityBenchmark(LUBM)[GPH05]•  BenchmarkintendedtofacilitatetheevaluationofSemantic

Webrepositories•  WidelyadoptedbythedataengineeringandSemanticWeb

communities

•  FocusesonevaluatingtheperformanceofqueryoptimizersandnotontologyreasoningasinDLsystems

•  Components:–  ScalableSyntheticdatagenerator– Ontologyofmoderatesizeandcomplexity–  Supportsextensionalqueries(i.e.,queriesthatrequestinstancesandnotonlyschemainformation)

– ProposesPerformancemetrics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 64

Page 65: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMUniv-BenchOntology•  Describesuniversitiesanddepartmentsandrelatedactivities•  ExpressedinOWLLite(tookintoconsiderationthe

limitationsofreasoningsystemsreg.completeness)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 65

Statistics:!  43Classes!  32ObjectTypeProperties!  7DataTypeProperties! OWLLiteinverseOf,TransitiveProperty,

someValuesFrom,intersectionOf

Page 66: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMDataGeneration(1)•  Syntheticallyproducedextensionaldatathatconformtothe

LUBMOntology•  DataaregeneratedusingtheUBA(Univ-BenchArtificialData

Generator)•  RandomandRepeatableDataGeneration•  Minimumunitofdatageneration:Universitythathas

departments,employees,courses•  Instancesofclassesandpropertiesarerandomlyproduced•  Tomakedatamorerealisticrestrictionsareapplied:–  «Minimum15andmaximum25departmentsperuniversity»–  «Undergraduatestudent/facultyratiobetween8and14inclusive»

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 66

Page 67: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMDataGeneration(2)•  AssignmentofIdentifiersisdoneusingzero-basedindexes– University0,Department0,…

•  Datageneratedbythetoolarerepeatablefortheuniversities– Userentersaseedfortherandomnumbergeneratoremployedinthedatagenerationprocess

•  DatacreatedarerepresentedinOWLLite•  Configurableserializationandrepresentationmodel(RDF/

XMLin.owlfiles,DAML)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 67

Page 68: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMQueries(1)•  14RealisticQueries•  WritteninSPARQL1.0•  QueryDesigncriteria–  InputSize:•  proportionoftheclassinstancesinvolvedandentailedinthequerytothetotalinstancesinthedataset

– Selectivity:•  estimatedproportionoftheclassinstancesthatsatisfythequerycriteria•  dependsontheinputdatasetsize

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 68

Page 69: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMQueries(2)– Complexity:• measuredonthebasisofthenumberofclassesandpropertiesinvolvedinthequery•  differentcomplexityforthesamequeryandfordifferentimplementations:relationalvsRDF

– Hierarchyinformation:•  classandpropertyhierarchiesareusedtoobtainallqueryanswers

–  Logicalinference:•  inferenceisrequiredtoobtainallqueryanswers

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 69

Page 70: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMQueries(3):Characteristics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 70

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14

Simplefilters

Complexfilters

>=9TPs

Unboundpredicates

Negation

OPTIONAL

LIMIT

ORDERBY

DISTINCT

REGEX

UNION

DESCRIBE

CONSTRUCT

ASK

SimpleSPARQLSELECTQuerieswithafocusonchallengingengines

regardingreasoning

Page 71: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMQueries(4):ChokePoints

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1

Q2 ✓

Q3 ✓

Q4 ✓ ✓

Q5 ✓

Q6 ✓

Q7 ✓

Q8 ✓

Q9 ✓

Q10 ✓

Q11 ✓

Q12 ✓ ✓

Q13 ✓

Q14

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 71

JoinOrderingMostcomplexquerycontains5joins

ReasoningFocusonsubClassandsubProperty

hierarchies

Page 72: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMPerformanceMetrics(1)•  LoadTime:–  Timeneededtoparse,loadandreasonforadataset–  Focusesonpersistentstores

•  RepositorySize:–  Forpersistentstorageonly–  Thesizeofallfilesthatconstitutetherepository

•  QueryResponseTime:– Averagetimeforexecutingaquery10times(warmrun)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 72

Page 73: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LUBMPerformanceMetrics(2)•  QueryCompletenessandSoundness:– Measuresthedegreeofcompletenessofaqueryanswerasthepercentageofentaileduniqueanswers

•  CombinedMetric:– Combinesqueryresponsetimewithanswercompletenessandanswersoundness

– Measuresthetrade-offbetweenqueryresponsetimeandcompletenessofresults•  Seehowreasoningaffectsqueryperformance

– Providesanabsoluterankingofsystems– Buthidesdetails!

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 73

Page 74: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SP2Bench[SHM+09]•  Proposesalanguagespecificbenchmarktotestthemost

commonSPARQLconstructs,operatorconstellationsandRDFaccesspatterns

•  Components:–  Scalablesyntheticdatagenerator•  CreationofDBLPdocumentsinRDFmimickingkeycharacteristicsoftheoriginalDBLPdataset•  ProduceddatasetscontainblanknodesandRDFcontainers

–  Supportsextensionalqueries(i.e.,queriesthatrequestinstancesandnotschemainformation)

– Proposesperformancemetrics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 74

Page 75: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SP2BenchSchemaDBLP(1)

•  StudyofDBLPrealdata–  Determinetheprobabilitydistributionforselectedattributesperdocumentclassesthatformsthebasisforgeneratingclassinstances

–  Revealsthatonlyfewoftheattributesarerepeatedforthesameclass

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 75

<!ELEMENTdblp(article|inproceedings|proceedings|book|incollection|phdthesis|masterthesis|www)*><!ENTITY%field“author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|publisher|note|crossref|isbn|series|school|chapter”><!ELEMENTarticle(%field)*><!ELEMENTinproceedings(%field)*>

ExtractDBLPDTD2008

Page 76: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SP2BenchSchemaDBLP(2)•  Probabilitydistributionforselectedattributesperdocument

classes

•  UseGaussiancurvestoapproximateinputdata–  Typicallyusedtomodelnormaldistributions

•  Studiedthenumberofclassinstancesovertimeandmodeledthosewithapowerlawdistribution

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 76

Article Inproc. Proc. Book WWW

author 0.9895 0.9970 0.0001 0.8937 0.9973

cite 0.0048 0.0104 0.0001 0.0079 0.0000

editor 0.0000 0.0000 0.7992 0.1040 0.0004

isbn 0.0000 0.0000 0.8592 0.9294 0.0000

… … … … … …

Page 77: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SP2BenchDataGeneration

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 77

•  SyntheticallyproducedextensionaldatathatconformtotheDBLPSchema

•  Useofexistingexternalvocabulariestodescriberesourcesinauniformway–  FOAF(persons)–FriendofAFriend[FOAF],SWRC-SemanticWeb

forResearchCommunities(scientificpublications)[SWRC],DC–DublinCore[DC]

•  IntroduceblanknodesandRDFcontainers(rdf:Bag)tocaptureallaspectsoftheRDFdatamodel

•  DatagenerationtakesintoaccountdataapproximationasreflectedintheGaussiancurves

•  Datageneratortakesasinputeitherthetriplecount,oryearuptowhichthedataisgenerated–  Alwaysendingupinaconsistentstate!

•  Randomfunctionsarebasedonafixedseedmakingdatagenerationdeterministic

Page 78: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SP2BenchQueries(1):Characteristics•  12queries•  Providedinnaturallanguage,inSPARQL1.0andSQL

translationsarealsoavailable•  Querydesigncriteria–  FocusonSELECTandASKSPARQLforms– AimatcoveringthemajorityofSPARQLconstructs(includingDISTINCT,ORDERBy,LIMIT,OFFSET)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 78

Page 79: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SP2BenchQueries(2):Characteristics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 79

Characteristic Q1 Q2 Q3abc Q4 Q5ab Q6 Q7 Q8 Q9 Q10 Q11 Q12abc

Simplefilters ✔ ✔ ✔ ✔Complexfilters ✔ ✔ ✔>=9TPs ✔ ✔ ✔ ✔ ✔Unboundpredicates

✔ ✔

Negation ✔ ✔

OPTIONAL ✔ ✔ ✔LIMIT ✔ORDERBY ✔ ✔DISTINCT ✔ ✔ ✔ ✔ ✔ ✔REGEX

UNION ✔ ✔ ✔DESCRIBE

CONSTRUCT

ASK ✔

Page 80: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SP2BenchQueries(3):ChokePoints

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✓

Q2 ✓ ✓

Q3 ✓

Q4 ✓ ✓ ✓

Q5 ✓ ✓ ✓

Q6 ✓ ✓ ✓ ✓

Q7 ✓ ✓ ✓ ✓

Q8 ✓ ✓ ✓ ✓ ✓

Q9 ✓ ✓

Q10

Q11 ✓

Q12 ✓ ✓ ✓

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 80

JoinOrdering:mostcomplexquerycontains8joins

Filters:mostcomplexquerycontains2filters

DuplicateElimination

Page 81: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SP2BenchPerformanceMetrics•  LoadingTime:–  timeneededtoparse,loadandreasonusingthetestedsystemforadataset

–  Focusesonpersistentstores•  «Per-query»performance:–  Performanceofeachquery

•  «Global»performance:–  Listthearithmeticandgeometricmeanofqueries

1.  Multiplytheexecutiontimeofall17queries2.  Penalizequeriesthatfailwith3600spenalty3.  Computethe17throotoftheresult

•  Memoryconsumption–  Highwatermarkofmainmemoryconsumption–  Averagememoryconsumptionofallqueries

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 81

Page 82: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BerlinSPARQLBenchmark(BSBM)[BS09][BSBM]•  Builtaroundane-commerceusecase•  Querymixemulatesthesearchandnavigationpatternsofauser

lookingforaproductofinterest•  Goals–  AllowthecomparisonofSPARQLenginesacrossdifferentarchitectures(relationaland/orRDF)

–  Challengeforwardandbackwardchainreasoningengines–  Focusesonanenterprisesettingwheremultipleclientsconcurrentlyexecuteworkloads

– MeasuresSPARQLqueryperformanceandnot(somuch)reasoning

•  Components–  Datagenerator:supportsthecreationofarbitrarilylargedatasets

–  TestDriver:executessequencesofSPARQLqueries5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 82

Page 83: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BSBMSchema(1)•  E-commerceusecase:productsareofferedbyseveralvendors

andconsumerspostreviewsforthoseproducts

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 839..22

Reviewbsbm:reviewForrev:reviewerbsbm:reviewDatedc:titlerev:textbsbm:rating1[0..1]bsbm:rating2[0..1]bsbm:rating3[0..1]bsbm:rating4[0..1]

Producerrdfs:labelrdfs:commentrdf:typefoaf:homepagebsbm:country

ProductTyperdfs:labelrdfs:commentrdf:typerdfs:subClassOf[1..0]

ProductFeaturerdfs:labelrdfs:commentrdf:type

Productrdfs:labelrdfs:commentrdf:typebsbm:producerbsbm:productFeature[9..22]bsbm:productPropertyTextual1bsbm:productPropertyTextual2bsbm:productPropertyTextual3bsbm:productPropertyTextual4[0..1]bsbm:productPropertyTextual5[0..1]bsbm:productPropertyNumeric1bsbm:productPropertyNumeric2bsbm:productPropertyNumeric3bsbm:productPropertyNumeric4[0..1]bsbm:productPropertyNumeric5[0..1]

Offerbsbm:productbsbm:vendorbsbm:pricebsbm:validFrombsbm:validTobsbm:deliveryDaysbsbm:offerWebpage

Personfoaf:namefoaf:mbox_sha1sumbsbm:country

Vendorrdfs:labelrdfs:commentrdf:typefoaf:homepagebsbm:country

1..89

1

1..*

1..*

1..*

1

2..16

14..32

1

280..3730

2..37

1

Page 84: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BSBMSchema&DataCharacteristics(1)•  Everyproducthasatypefromaproducthierarchy•  ProductHierarchyisnotfixed(dependsonthedatasetsize)–  It’sdepthandwidthdependsonthechosenscalefactor–  Hierarchydepth–  Branchingfactorfor

•  rootlevel•  allotherlevelsis8

•  Producttypesareassignedavariablenumberofproductfeatures–  computedaslowerBoundandupperBoundwith

•  aa–  Setofpossiblefeaturesforagivenproducttypeistheunionofthetypeandallits“super-types”.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 84

d =1+round(log10(n)) / 2n

bfr =1+ round(log10(n))

lowerBound = 35* i / (d *(d +1) / 2−1),upperBound = 75* i / (d *(d +1) / 2−1)

Page 85: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BSBMSchema&DataCharacteristics(2)•  Products,Vendors,Offers–  Productsthatsharethesametype,havealsothesamesetoffeatures

–  Foragivenproduct,itsfeaturesarechosenfromthesetofpossiblefeatureswithahard-codedprobabilityof25%

–  Normaldistributionwithameanofμ=50andstandarddeviationσ=16.6isemployedtoassociateproductswithproducers

–  Vendorsareassociatedtocountriesfollowinghard-codeddistributions

–  Sizeofoffersisn*20 aredistributedoverproductsfollowinganormaldistributionwith«fixedparameters»μ=n/2andσ=n/4

–  Offersaredistributedovervendorsfollowinganormaldistributionwith«fixedparameters»μ=2000andσ=667

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 85

Page 86: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BSBMSchema&DataCharacteristics(3)•  Reviews–  10timesthescalefactorn –  Datatypepropertyvalues(titleandtext)between50–300words

–  Upto4ratings,eachratingisarandomintegerbetween1and10

–  Eachratingismissingwithhard-codedprobability10%–  Distributedoverproductswithanormaldistributiondependingondatasetsizeandfollowingμ=n/2andσ=n/4

–  Numberofreviewsperreviewerfollowsnormaldistributionwithμ=20andσ=6.6

–  Reviewsaregenerateduntilallreviewsareassignedareviewer–  Reviewercountriesfollowthesamedistributionasvendorcountries

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 86

Page 87: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BSBMDataGeneration(1)•  SyntheticallyproducesinstancesofclassProductthatconformto

theBSBMSchema

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 87

Total#triples 250K 1M 2M 100M

#products 666 2,785 70,812 284,826

#productfeatures

2,860 4,745 23,833 47,884

#producttypes 55 151 731 2011

#producers 14 60 1422 5,618

#vendors 8 34 722 2,854

#offers 13,320 55,700 1,416,240 5,696,520

#reviewers 339 1432 36,249 146,054

#reviews 6,660 27,850 708,120 2,848,260

Total#instances 23,922 92,757 2,258,129 9,034,027

Indicativenumberofinstancesfordifferentdatasetsizes

Page 88: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BSBMQueries(1)•  12Queries•  Querymixisemulatessearchandnavigationpatternsofacustomer

lookingforaproduct•  BSBMqueriesaregiveninnaturallanguage,SPARQLandSQL

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 88

Query Description

Q1 Findproductsforagivensetofgenericfeatures

Q2 Retrievebasicinformationaboutaspecificproductfordisplaypurposes

Q3 Findproductshavingsomespecificfeaturesandnothavingonefeature

Q4 Findproductsmatchingtwodifferentsetsoffeatures

Q5 Findproductsthataresimilartoagivenproduct

Q6 Findproductshavingalabelthatcontainsaspecificstring

Q7 Retrievein-depthinformationaboutaproductincludingoffersandreviews

Q8 Givemerecentlanguagereviewsforaspecificproduct

Q9 Getinformationaboutareviewer

Q10 Getcheapofferswhichfulfilltheconsumer’sdeliveryrequirements

Q11 Getallinformationaboutanoffer

Q12 Exportinformationaboutanofferintoanotherschema

Page 89: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12

Simplefilters ✔ ✔ ✔ ✔ ✔ ✔ ✔

Complexfilters ✔ ✔

>9TPs ✔ ✔ ✔ ✔ ✔

Unboundpredicates

Negation ✔

OPTIONAL ✔ ✔ ✔ ✔

LIMIT ✔ ✔ ✔ ✔ ✔ ✔

ORDERBY ✔ ✔ ✔ ✔ ✔ ✔

DISTINCT ✔ ✔ ✔

REGEX ✔

UNION ✔ ✔

DESCRIBE ✔

CONSTRUCT ✔

ASK

BSBMQueries(2):Characteristics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 89

11JOINs,3OPTIONALclauses,

3Filters,1Unboundvariable

4OPTIONALclauses

Page 90: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BSBMQueries(3):ChokePoints

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 90

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✔ ✔ ✔ ✔

Q2 ✔Q3 ✔ ✔ ✔

Q4 ✔ ✔ ✔ ✔Q5 ✔ ✔ ✔ ✔Q6 ✔ ✔Q7 ✔ ✔ ✔Q8 ✔ ✔ ✔

Q9 ✔Q10 ✔ ✔ ✔ ✔Q11 ✔

Q12 ✔

JoinOrdering:mostcomplexquerycontains11joins

Filters:mostcomplexquerycontains3filtersandmostcomplexfiltercontainsarithmeticexpressions

ResultOrdering

Page 91: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BSBM:PerformanceMetrics•  QueryMixesperHour(QMpH)

–  MeasuresthenumberofcompleteBSBMquerymixesansweredbyasystemundertestandforaspecificnumberofclientsrunningconcurrentlyagainstthesystemundertest

•  QueriesperSecond(QpS)–  Measuresthenumberofqueriesofaspecifictypehandledbythe

systemundertestinasecond–  Calculatedbydividingthenumberofqueriesofaspecifictypewithin

abenchmarkrunbythetotalexecutiontimeofthosequeries•  Musttakeintoconsiderationtherunningtimeofallqueriesandnotofqueriesofthespecifictype

•  LoadTime:–  TimetoloadthedatasetintheRDForrelationalrepositories

•  Includesthetimetocreatetheappropriatedatastructures&indices

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 91

Page 92: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SemanticPublishingBenchmark(SPB)•  DevelopedinthecontextofFP7EUProjectLDBC(2012-2015)•  LDBC’sgoals:– Developqueryingbenchmarksthatwillspurresearch&industryprogressinlarge-scalegraphandRDFdatamanagement•  scalability,storage,indexingandqueryoptimizationtechniquesforRDFandgraphdatabasesolutions•  quantitativelyandqualitativelyassessdifferentsolutionsforRDFdataintegration

–  Toestablishanindustry-neutralentity-LDBCfoundation-àlatheTransactionProcessingCouncil(TPC)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 92

Page 93: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SemanticPublishingBenchmark(SPB)•  Industry-motivatedbenchmark–  Thescenarioinvolvesamedia/publisherorganizationthatmaintainssemanticmetadataaboutitsJournalisticassets

•  Components–  ScalableSyntheticDataGenerator•  CreationofinstancesofBBContologiesmimickingcharacteristicsoftheoriginalrealinputdatasets

–  Supportsextensionalqueries(i.e.,queriesthatrequestinstancesandnotschemainformation)

– WorkloadsimulatesconsumptionofRDFmetadata•  Concurrentreadandupdatequeries

– Proposesperformancemetrics5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 93

Page 94: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBDesign:Requirements•  StoringandprocessingRDFdata–  StoringandisolatingdatainseparateRDFgraphs–  SupportingfollowingSPARQLstandards:

•  SPARQL1.1Protocol,Query,Update

•  SupportforSchemaLanguages–  SupportforRDFStoobtainthecorrectanswers–  OptionalsupportfortheRLprofileofWebOntologyLanguage(OWL2RL)inordertopasstheconformancetestsuite

•  LoadingdatafromRDFserializationformats–  N-Quads,TRIG,Turtle,etc.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 94

Page 95: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBSchema:BBCOntologies(1)•  CoreOntologies:7ontologiesdescribebasicconceptsabout

entitiesandrelationshipsinthedomainofinterest–  BasicConcepts:CreativeWorks,Places,Persons,ProvenanceInformation,CompanyInformation,etc.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 95

Thing CreativeWork

String

cwork:title

owl:Thing owl:sameAs

Theme Organisation

Event PlacePerson Programme

NewsItemBlogPost

cwork:tag

cwork:shortTitle

String

cwork:categoryxsd:Any

cwork:description

String

Audience

International Audience National Audience

cwork:audience

cwork:Format

Textual Format

VideoFormat

Interactive Format

Image Format Audio Format

PictureGallery Format

cwork:primaryFormat

xsd:dateTime

xsd:dateTime

cwork:dateModifiedcwork:dateCreated

cwork:Thumbnail

cwork:thumbnail

Thumbnail ThumbnailTypethumbnailType

StandardThumbnail

FixedSize66Thumbnail

CloseUpThumbnail

FixedSize266Thumbnail

FixedSize466Thumbnailp

rdfs:subClassOf rdfs:subPropertyOf

rdf:type

tag

about mentions

Stringcwork:altText

Page 96: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SchemaBBCSchema(2)•  DomainOntologies:3ontologiesdescribeconceptsand

propertiesrelatedtoaspecificdomain–  sports(competitions,events)–  politicsentities–  news(conceptsthatjournaliststagannotationswith)

•  Statistics–  74classes–  88datatypeproperties,28objecttypeproperties–  60rdfs:subClassOf(maximumdepth3),17rdfs:subPropertyOf(maximumdepth1)hierarchies

–  105rdfs:domainand115rdfs:rangeRDFSproperties–  8owl:oneOfclassaxioms,1oneowl:TransitivePropertyproperty.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 96

Page 97: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPB:Referencedatasets•  Collectionsofentitiesdescribingvariousdomains–  SnapshotsoftherealdatasetsofBBC

•  Footballcompetitionsandteams•  FormulaOnecompetitionsandteams•  UKParliamentMembers

–  Additionaldatasets•  GeoNames-Places,namesandcoordinates•  DBPedia–Persondata

–  ReferenceDatasetSize:25Mtriples

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 97

Page 98: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBDataGeneration(1):Process

1.   Loader–  Ontology&ReferenceData

2.   DataGeneratora.  Retrievesinstances

fromReferenceDatasetsb.  GeneratesCreativeWorks

accordingtopre-definedallocationsandmodels

c.  Writesgenerateddatatodisk

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 98

RDFRepository

BBCOntologies

ReferenceDatasets

Ontology&ReferenceDataSetLoader

CreativeWorksGenerator

SPARQLEndpoint

SPBDataGenerator

Datagenerationparameters

(1) (1)

(2.a)

GeneratedCWs

(2.c)

(1)

(2.d)

Page 99: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBDataGeneration(2)•  Producessyntheticdatathatmimicmostofthecharacteristicsofreal

worlddataprovidedbyBBC•  Input:Core&DomainOntologiesandReferencedatasets•  Output:

–  InstancesthatconformtoBBCcoreontologies(classCreativeWork)–  Instancesrefertoentitiesinthereferencedatasetsusingtheabout&

mentionsschemaproperties–  followsthe(user)pre-defineddistributionsofSPB’sDataGenerator

Tagg

edentities

01/2012 12/2012

clustering

correla1onsrandomdistribu1on

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 99

Page 100: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBQueries(1)•  BaseandAdvancedWorkloads– BaseWorkload:12queries&updateoperations– AdvancedWorkload:24queries

•  WorkloadsbasedonrealqueriesusedbyBBCjournalistsduringtheireditorialoperations

•  Editorialagents–simulateeditorialworkperformedbyjournalists:–  Insert,Update,Delete

•  Aggregationagents–simulateretrievaloperationsperformedbyend-users

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 100

Page 101: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBOperationalPhases•  DataLoading

1.  Initialloadingofreferencedatasets•  BBCdatasetsenricheddatasetswithDBPediaPersondataandGeoNamesplacedata

2.  GenerationofCreativeWorks•  Parallelgeneration(multi-threadedandmulti-process)

3.  LoadingofCreativeWorksintheRDFrepository

•  RunningtheBenchmark1.  Warm-upphrase2.  RunthebenchmarkusingtheTestDriver3.  Runconformancetests(OWL2RL)[optional]

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 101

Page 102: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BenchmarkConfiguration•  DataGenerator–  AllocationoftagsinCreativeWorks

•  Correlationsofcreativeworkswithimportantentities(persons,places,events)•  ClusteringofCreativeWorksaroundmajor/minorevents

–  Sizeofgenerateddata(triples)–  Paralleldatageneration

•  TestDriver–  Distributionofqueriesinthequery-mix

•  editorialoperations(deletion/additionofRDFtriples)•  aggregateoperations(complexSPARQLqueries)

–  Numberofeditorial/aggregationagents–  DurationofWarm-upandBenchmarkphases–  Eachoperationalphasecanbeenabledordisabled

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 102

Page 103: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBBaseWorkloadQueries(2)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 103

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12

Simplefilters ✔

Complexfilters ✔ ✔ ✔>9TPs ✔ ✔ ✔ ✔Unboundpredicates

Negation

OPTIONAL ✔ ✔ ✔ ✔LIMIT ✔ ✔ ✔ ✔ ✔ ✔ORDERBY ✔ ✔ ✔ ✔ ✔ ✔ ✔DISTINCT ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔COUNT ✔

REGEX

UNION ✔ ✔ ✔GROYPBY ✔CONSTRUCT ✔ ✔ ✔ ✔ ✔

Evaluate(partsofthe)queryongraphs

Page 104: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBBaseWorkloadQueries(3):ChokePoints

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 104

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✔ ✔ ✔ ✔ ✔Q2 ✔ ✔ ✔Q3 ✔ ✔ ✔ ✔ ✔ ✔Q4 ✔ ✔ ✔ ✔ ✔Q5 ✔ ✔ ✔ ✔ ✔Q6 ✔ ✔ ✔ ✔Q7 ✔ ✔Q8 ✔ ✔ ✔Q9 ✔ ✔ ✔Q10 ✔ ✔ ✔ ✔Q11 ✔ ✔ ✔ ✔ ✔Q12 ✔ ✔

Reasoningreg.class&propertyhierarchies

JoinOrdering

Ordering&DuplicateElimination

Page 105: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

SPBPerformanceMetrics•  SPBPrimaryMetrics

•  QueryExecutionReport(1)

•  QueryExecutionReport(2)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 105

QueryRateInteractivemix

(Queriespersecond)

QueryRateAnalyticalMix

(Queriespersecond)

UpdateRate(Operationsper

second)

DurationofBulkLoad(inms)

DurationofMeasurement

Window(inminutes)

#CompleteAnalyticalmixes

(persecond)

#CompleteInteractivemixes

(persecond)

#CompleteUpdate

Operations

Query ArithmeticMeanExecutionTime

MinimumExecutionTime

90th%AverageExecutionTime

#Executions

Page 106: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

RealRDFBenchmarks

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 106

Page 107: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

UniProt[RU09][UniprotKB]•  Comprehensive,high-qualityandfreelyaccessibleresourceof

proteinsequenceandfunctionalinformation•  UniProtSchema–  UniProtCoreVocabulary,BIBO(journals),ECO(evidencecodes),DublinCore(metadata)

–  UniProtCoreVocabulary:124classes,113Properties•  Datasetcontainsapproximately–  13billiontriples–  2.5billiondistinctsubjects–  2billiondistinctobjects

•  Queries–  Norepresentativesetofqueriesisoffered.–  [NW09]offersasetof8queriestotesttheRDF-3Xengine

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 107

Page 108: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

UniProtQueries(1)[NW09]:Characteristics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 108

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

Simplefilters

Complexfilters

>9TPs ✔ ✔ ✔ ✔ ✔ ✔Unboundpredicates

Negation

OPTIONAL

LIMIT

ORDERBY

DISTINCT

REGEX

UNION

DESCRIBE

CONSTRUCT

ASK

JoinOrderingRDF-3XaimsatoptimizingjoinprocessingforRDFdata

Page 109: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

UniProtQueries(2)[NW09]:ChokePoints•  Focusesondiscoveringoptimalorclosetooptimaljoinorders

followingadynamicprogrammingalgorithm

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 109

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✔Q2 ✔Q3 ✔Q4 ✔Q5 ✔Q6 ✔Q7 ✔Q8 ✔ JoinOrdering:most

complexquerycontains12joins7queriescontainmorethan7joins

Page 110: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

YAGO(YetAnotherGreatOntology)[SKW07]•  Highqualitymultilingualknowledgebasedderivedfrom

Wikipedia,WordNetandGeoNames•  Schema– WikipediaEntities,WordNetandGeoNamesConceptsandRelationships:associatesWordNettaxonomywithWikipediaCategorySystem

–  10millionschemaentities•  Dataset–  120milliontriplesaboutschemaentities–  2.625millionlinkstoDBPedia

•  Queries–  NorepresentativesetofqueriesisofferedbyYAGO–  [NW10]providesarepresentativesetof8queriesforRDF-3XEvaluation

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 110

Page 111: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

YAGOQueries(1)[NW10]:Characteristics•  SimpleSELECTqueriesthatfocusonJoinordering,negation

andduplicateelimination

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 111

Characteristic A1 A2 A3 B1 B2 B3 C1 C2

Simplefilters ✔

Complexfilters

>9TPs ✔

Unboundpredicates

Negation ✔ ✔ ✔

OPTIONAL

LIMIT

ORDERBY

DISTINCT ✔ ✔ ✔ ✔ ✔

REGEX

UNION ✔

Page 112: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

YAGOQueries(2)[NW10]:ChokePoints•  Queriesfocusmostlyondiscoveringoptimalorclosetoquery

evaluationplans,includingnegationinfiltersandduplicateelimination.

• 

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 112

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

A1 ✔

A2 ✔

A3 ✔ ✔ ✔ ✔

B1 ✔ ✔ ✔

B2 ✔ ✔

B3 ✔ ✔ ✔

C1 ✔ ✔ ✔ ✔

C2 ✔ ✔JoinOrdering:mostcomplexquerycontains8joins

allqueriescontainmorethan5joins

Page 113: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BartonLibrary[Barton]•  DatafromtheMITSimileProjectthatdevelopstoolsforlibrarydata

managementandinteroperability–  containsrecordsthatcomposeanRDF-formatteddumpoftheMIT

LibrariesBartoncatalog–  convertedfromrawdatastoredinanoldlibraryformatstandard

calledMARC(MachineReadableCatalog).•  Schema

–  CommontypesincludeRecordandItem,thelatterbeingassociatedwithinstancesoftypePersonandwithinstancesofDescription.

–  PrimitivetypesincludeTitleandDate.•  Dataset

–  Approximately45millionRDFtriples•  Queries

–  NorepresentativequeriesprovidedwiththeBartonLibraryDataset–  [Abadi07]providesaworkloadof7queries([NW10]inSPARQL)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 113

Page 114: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BartonQueries(1)[NW10]:Characteristics

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 114

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7

Simplefilters ✔ ✔ ✔ ✔

Complexfilters

>9TPs

Unboundpredicates

Negation ✔

OPTIONAL

LIMIT

ORDERBY

DISTINCT ✔ ✔ ✔

REGEX

UNION ✔

Page 115: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BartonQueries(2)[NW10]:ChokePoints•  Queriesfocusmostlyondiscoveringoptimalorclosetoquery

evaluationplans,includingnegationinfiltersandduplicateelimination.

• 

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 115

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✔

Q2 ✔ ✔

Q3 ✔

Q4 ✔

Q5 ✔ ✔

Q6 ✔ ✔ ✔ ✔

Q7 ✔

JoinOrdering:mostcomplexquerycontains3joins

Page 116: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

LinkedSensorDataset[PHS10]•  Expressivedescriptionsofapproximately20,000weather

stationsintheUS•  dividedupintomultiplesubsets,thatreflectweatherdatafor

specifichurricanesorblizzardsfromthepast(focusonhurricaneIke)

•  Schema–  Containsinformationabouttemperature,precipitation,pressure,wind,speed,humidity

–  ContainslinkstoGeoNamesandlinkstoobservationsprovidedbyMesoWest(meteorologicalserviceintheUS)

•  Dataset– morethan1billiontriples

•  Queries–  Norepresentativesetofqueriesisoffered.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 116

Page 117: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WordNet[WordNet]•  LargelexicaldatabaseofEnglish,developedunderthe

directionofGeorgeA.Miller(Emeritus).•  Schema–  Nouns,verbs,adjectivesandadverbsaregroupedintosetsofcognitivesynonyms(synsets),eachexpressingadistinctconcept.

–  Synsetsareinterlinkedbymeansofconceptual-semanticandlexicalrelations.Theresultingnetworkofmeaningfullyrelatedwordsandconceptscanbenavigatedwiththebrowser.

•  Dataset–  Approximately1.9milliontriples(300MB).

•  Queries–  Norepresentativequeryworkload

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 117

Page 118: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

PublishingTPC-HasRDF[TPC-H]•  Αsuiteofbusinessorientedad-hocqueriesandconcurrentdata

modifications

•  Queriesandthedatapopulatingthedatabasehavebeenchosentohavebroadindustry-widerelevance

•  benchmarkillustratesdecisionsupportsystemsthatexamine

–  largevolumesofdata,executequerieswithahighdegreeofcomplexity,andprovideanswerstocriticalbusinessquestions

•  UsetheDBGENTPC-HgeneratortogenerateaTPC-Hrelationaldataset

•  UsetheD2RtoolorotherrelationaltoRDFtoolstoconverttherelationaldatasettotheequivalentRDFone.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 118

Page 119: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

BenchmarkGenerators

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 119

Page 120: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DBPediaSPARQLBenchmark(DBSB)[MLA+14]•  GenericMethodologyforSPARQLBenchmarkCreation•  Basedon–  Flexibledatagenerationthatmimicsaninputdatasource–  Query-logmining–  Clusteringofqueries–  SPARQLqueriesfeatureanalysis

•  Methodologyisschemaagnostic–  DemonstratedusingDBPediaKB

•  ProposedapproachappliedonvarioussizesoftheDBPediaKnowledgeBase

•  BenchmarkproposesqueryworkloadbasedonrealqueriesexpressedagainstDBPedia

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 120

Page 121: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DBSBDataGeneration(1)•  Workingassumptions

1.  Outputdatasetshouldhavesimilarcharacteristicsasinputdataset

• Numberclasses,properties,valuedistributions,taxonomicstructures(hierarchies)

2.  Varyingoutputdatasetsizes3.  Characteristicssuchasin-,out-degreeofnodesin

datasetsofvaryingsizesshouldbesimilar

4.  Easilyrepeatabledatagenerationprocess

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 121

Page 122: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DBSBDataGeneration(2)•  Idea

1.   Largedatasetsproducedby•  Duplicatingalltriplesandchangingtheirnamespace

2.   Smallerdatasetsproducedby•  Removingtriplesinawaythatwouldpreservethe

propertiesoftheoriginalgraph•  Usingaseedbasedmethodbasedontheassumptionthata

representativesetofresourcesisobtainedbysamplingacrossclasses1.  Foreachselectedelementinthedataset,itsconcise

bounddescription(CBD)isretrievedandaddedinthequeue

2.  Processisrepeateduntilthenumberoftriplesisreached

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 122

Page 123: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DBSBQueryAnalysis(1)•  Goalistodetectprototypicalqueriesthatweresenttoa

DBPediaSPARQLendpointusingsimilaritymeasures–  Stringsimilarityandgraphsimilarity

•  Idea:4-stepqueryanalysisandclusteringapproach1.  Selectqueriesexecutedfrequentlyontheinputdata2.  Stripcommonsyntacticconstructs(namespace,prefixes)3.  Computequerysimilarityusingstringmatching4.  Computequeryclustersusingasoftgraphclustering

algorithm[]•  Clustersusedtodevisethebenchmarkquerygenerationpatterns

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 123

Page 124: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DBSBQueryAnalysis(2)•  QuerySelection

1.  UseDBPediaSPARQLQuerylog(31.5millionqueriesina3monthperiod)

2.  Reducetheinitialsetofqueriesbyconsidering•  QueryVariations:useastandardwaytonamevariablestoreducedifferencesamongqueries(promotingqueryconstructssuchasDISTINCT,REGEX)•  QueryFrequency:discardquerieswithlowfrequencysincetheydonotcontributetotheoverallqueryperformance

–  Result:35,965queries3.  StringStripping:removeallSPARQLkeywordsandcommon

prefixes4.  SimilarityComputation:computethesimilarityofthestripped

queries

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 124

Page 125: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DBSBQueryAnalysis(3)•  QuerySelection(cont’d)

4.  SimilarityComputation•  Reducethetimeofbenchmarkcompilation,useLIMES[NS11]framework

•  UsetheLevenshteinstringsimilaritymeasure,0.9threshold•  Reduceby16.6%thenumberofcomputationsrequiredbycomputingtheCartesianproductofqueries

5.  Clustering•  Applygraphclusteringtothequerysimilaritygraphof(4)•  Goalistoidentifysimilargroupsofqueriesoutofwhichprototypicalquerieswillbegenerated

•  UseBorderFlow[NS09]algorithmthatfollowsaseed-basedapproach

•  Obtain12272clusters,24%containasinglequery•  Selecttheclusterswith>5queries

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 125

Page 126: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DBSBQueryGeneration(1)•  SelectthemostinterestingSPARQLqueries

–  WhicharethemostfrequentlyaskedSPARQLqueries–  WhichofthosequeriescoverthemostSPARQLfeatures

•  SPARQLFeatures–  Overallnumberoftriplepatterns

•  Testtheefficiencyofjoinoperations(CP1)–  SPARQLpatternconstructors(UNION&OPTIONAL)

•  HandleparallelexecutionofUnions(CP5)•  PerformOPTIONALsaslateaspossibleinthequeryplan(CP3)

–  Solutionsequences&modifiers(DISTINCT)•  Efficiencyofduplicationelimination(CP10)

–  Filterconditionsandoperators(FILTER,LANG,REGEX,STR)•  Efficiencyofenginestoexecutefiltersasearlyaspossible(CP6)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 126

Page 127: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

DBSBQueryGeneration(2)•  25queriesareselected–  Foreachofthefeatures,manuallyselectthepartofthequerytobevaried(IRIorfiltercondition)

–  Variabilityofquerytemplate(s)forthechosenvaluesissufficientlyhigh(>=1000perquerytemplate)

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 127

Methodensuresthat•  Executedqueriesduringthebenchmarkdiffer

•  Alwaysreturnnonresults

Page 128: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WaterlooSPARQLDiversityTestSuite[AHO+14]•  StressexistingRDFenginestorevealawiderrangeofquery

requirementsasestablishedbywebapplications•  Contributions– Definitionof2classesofqueryfeaturesusedtoevaluatethevariabilityofworkloadsanddatasets•  Structural(e.g.,numberoftriplepatterns)•  Data-driven(affectselectivityandresultcardinality)

–  In-depthanalysisofexistingSPARQLbenchmarksusingthestructuralanddata-drivenfeatures

– WatDivTestSuitetostressexistingRDFenginestorevealawiderrangeofqueryrequirements

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 128

Page 129: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivStructuralFeatures(1)1.  TriplePatternCount

–  NumberoftriplepatternsinSPARQLGraphPatterns2.  JoinVertexCount

–  NumberofRDFterms(IRIs,literals,blanknodes)andvariablesthataresubjectsorobjectsofmultipletriplepatterns

3.  JoinVertexDegree–  Thedegreeofajoinvertexvisthenumberoftriplepatternswhose

subjectorobjectisv –  inproc

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 129

SP2BenchQ5aSELECTDISTINCT?person?nameWHERE{?articlerdf:typebench:Article.?articledc:creator?person.?inprocrdf:typebench:Inproceedings.?inprocdc:creator?person2.?personfoaf:name?name.?person2foaf:name?name2FILTER(?name=?name2)}

TripleCount

JoinVertices

JoinVertexCount

JoinVertexDegree

6 ?article,?inproc?person,?person2

?name,?name2

10 ?article:2,?inproc:2?person:2,?person2:2

?name:1,?name2:1

Page 130: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivStructuralFeatures(2)•  JoinVertexDegree&Countprovideagoodcharacterizationof

thestructuralcomplexityofaquery–  Numberoftriplepatternsdoesnotproperlycharacterizethequery:twoquerieswiththesamesetoftriplepatternscanhavedifferentstructures

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 130

?n

?m ?x

?l

C

E

?k

A

?y

?b

?z

?d ?o

Linearquery

?c

D D ?x

?b

B

?z

?z

C

?w

D

?b

A

E

?w

Snowflakequery

?y

?b

?x

B

E

A D

?z

C

?c

Starquery

?m ?f

G

Page 131: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivStructuralFeatures(3)•  JoinVertexType

–  PlayanimportantroleinthebehaviorofRDFenginestodetermineefficientqueryplans•  E.g.,starqueriespromoteefficientmergejoins

•  3(mutuallynon-exclusive)typesofjoinvertices–  Vertexx oftypeSS+ ifforalltriplepatterns(s,p,o)*, x isthesubject–  Vertexx oftypeOO+ ifforalltriplepatterns(s,p,o)*, x istheobject–  Vertexx oftypeSO+ ifforalltriplepatterns(s,p,o)*, (s’,p’,o’) x=s & x=o’

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 131

?n

?m ?x

?l

C

E

?k

?m type SS+

?x B

?z

C

?w

?x type OO+

?c

D D ?x

?b

B

?z

C

?w

?x type SO+

*Triplepa8erns(s,p,o) areincidentonx

Page 132: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivData-drivenFeatures(1)•  Asystem’schoiceonthemostefficientqueryplandependson

–  (a)thecharacteristicsofthedatasetand–  (b)thequery

•  Ifthesystemreliesonselectivityestimationsandresultcardinality,thesamequerywillhaveadifferentqueryplanfordataset(s)ofdifferentsizes

•  Differentcases:– Querieshaveadiversemixofresultcardinalities

–  Sometriplepatternsareveryselective,othersarenot

– Alltriplepatternsareequallyselective

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 132

Page 133: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivData-drivenFeatures(2)•  ResultCardinalityCARD(Ā,G)–  thenumberofsolutionsintheresultoftheevaluationofagraphpatternĀ = <A, F> overgraphG

•  FilterTriplePatternSelectivity(f-TPSelectivity)SELFG (tp)

–  theratioofdistinctsolutionmappingsofatriplepatterntptothesetoftriplesingraphG

•  Measures

1.  Resultcardinality

2.  Mean&standarddeviationoff-TPselectivitiesoftriplepatterns•  Importantfordistinguishingquerieswhosetriplepatternsarealmostequallyselectivefromquerieswithvaryingf-TPselectivities

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 133

Page 134: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivData-drivenFeatures(3)•  ResultCardinality&f-TPselectivityarenotsufficient

–  Intermediatesolutionmappingswillnotmakeittothefinalresult(e.g.,duetofiltersormorerestrictivejoins)

–  Theoverallselectivityofagraphpatterncanbedeterminedbyasingleveryselectivetriplepattern

•  Run-timeoptimizationtechniques(e.g.,side-waysinformationpassing)toearlypruneintermediateresults

•  Introduce2featurestocaptureabovecases1.  BGP-Restrictedf-TPselectivity

2.  Join-Restrictedf-TPselectivity

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 134

Page 135: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivData-DrivenFeatures(4)•  BGP-Restrictedf-TPselectivitySELF

G (tp|Ā)•  assesseshowmuchatriplepatterncontributestotheoverall

selectivenessofthequery

•  Fractionofdistinctsolutionmappingsforatriplepatternthatarecompatiblewithsomesolutionmappinginthequeryresult.

•  Join-restrictedf-TPselectivitySELF G (tp|x)•  assesseshowmuchafilteredtriplepatterncontributestothe

overallselectivenessofthejoinsthatitparticipatesin

•  forxajoinvertexandtp atriplepatternincidentonx, thex-restrictedf-TPoftp overgraphGisthefractionofdistinctsolutionmappingscompatiblewithasolutionmappinginthequeryresultofthesub-querythatcontainsalltriplepatternsincidenttox

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 135

Page 136: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivTestSuite(1)•  Components:DataGeneratorandQueryGenerator•  DataGenerator–  Allowsuserstodefinetheirowndatasetcontrolling

•  Entitiestoinclude•  TopologyofthegraphsallowingonetomimictherealtypesofdatadistributionsintheWeb– «well-structuredness»ofentities– probabilityofentityassociations– cardinalityofpropertyassociations

–  Important:Instancesofthesameentitydonothavethesamesetofattributes:breakingthe«relationalnature»ofpreviousRDFbenchmarks

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 136

Page 137: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivTestSuite(2)•  QueryTemplateGenerator–  User-specifiednumberoftemplates–  Userspecifiedtemplatecharacteristics

•  Numberoftriplepatterns•  Typesofjoinsandfiltersinthetriplepatterns

–  TraversestheWatDivschemausingarandomwalkandgeneratesasetofquerytemplates

•  QueryGenerator–  Instantiatesthequerytemplateswithterms(IRIs,literalsetc.)fromtheRDFdataset

–  User-specifiednumberofqueriesproduced

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 137

Page 138: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

WatDivTestSuite(3)•  QueryTemplateGenerator–  RandomWalkonaninternalrepresentationoftheschema

•  Entitytypesintheschemacorrespondtographvertices•  Relationships(i.e.,objecttypeproperties)aregraphedges•  Verticesareannotatedwithdatatypeproperties(i.e.,attributes)

–  ProducesasetofBasicGraphPatternswithamaximumntriplepatternswithunboundobjectsandsubjects

–  kuniformlyrandomlyselectedsubjects/objectsarereplacedwithplaceholders

–  PlaceholdersarereplacedwithactualRDFtermsrandomlyretrievedfromthedataset

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 138

Page 139: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ComparisonofWatDivwithotherRDFBenchmarks

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 139

Copy

right[A

HO+1

4]

•  QueryWorkload–  Largerangeofqueries

•  Meanjoinvertexdegreedistributedamong2and10–  JoinVertexTypes:

•  18%ofqueriesarestarjoins,4.4%inDBSB•  61.3%ofqueriesarepathqueries,5.4%inDBSB

Page 140: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ComparisonofWatDivwithotherRDFBenchmarks

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 140

Copy

right[A

HO+1

4]

•  Data-DrivenFeatures–  DBSBandBSBMcovertheendsofthespectrumofmeanJoin-Restrictedf-

TPselectivityvalues–  WatDivcoversthefullspectrumofRestrictedf-TPselectivityvalues–  WatDivcoversalowerrangeofvaluesformeanf-TPselectivitywhen

comparedtoDBSB

GeneralRemarks•  comparabletoDBSB•  morediversethanLUBM,SP2BenchandBSBM

Page 141: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

FEASIBLE[SNM15]

•  Proposesafeature-basedbenchmarkgenerationapproachfromrealqueries–  Structure-based–  Data-drivenbased

•  ApproachissimilartoWatDivTestSuite•  Novelsamplingapproachforqueriesbasedonexemplarsand

medoids•  ProposeSELECT,ASK,CONSTRUCTandDESCRIBESPARQL

queries

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 141

Page 142: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

FEASIBLEQueryFeatures•  NumberofTriplePatterns•  NumberofJoinVertices–  Distinguishingbetween«star»,«path»,«hybrid»and«sink»vertices

•  JoinVertexDegree–  Sumofincomingandoutgoingedgesofthevertex

•  TriplePatternSelectivity–  Ratiooftriplesthatmatchthetriplepatternoveralltriplesinthedataset

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 142

o1

x o2

p1

p2

x y p1 p2

z

Starvertex:x Pathvertex:y Hybridvertex:x

o1 x

o2

p1

p2 y

z

Sinkvertex:x

x

y

z

Page 143: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

FEASIBLEBenchmarkGeneration•  3-stepbenchmarkgeneration•  Data-setCleaning–  Leadstopracticallyreliablebenchmarks

•  NormalizationofFeatureVectors–  Queryselectionprocessrequiresdistancesbetweenqueriestobecomputed

–  Normalizethequeryrepresentationssothatallqueriesareinaunithypercube

•  QuerySelection–  Basedontheideaofexemplars[NS11]

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 143

Page 144: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

FEASIBLEBenchmarkGeneration

•  DatasetCleaning–  Removeerroneousandzero-resultqueriesfromthesetofrealqueriesusedtogeneratethebenchmark

–  Excludeallsyntacticallyincorrectqueries–  Syntacticallyincorrectqueriesthatleadtoruntimeareremoved–  Attach9SPARQLoperators(UNION,DISTINCT,OPTIONAL,..)and7queryfeatures(joinvertices,joinvertexcountetc.)toeachofthequeries

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 144

Page 145: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

FEASIBLEBenchmarkGeneration

•  NormalizationofFeatureVectors–  Queriesaremappedtoavectoroflength16whichstoresthequeryfeatures•  ForbinarySPARQLclauses(e.g.,UNIONiseitherusedornotused),storevalue1.Elsestorevalue0•  Allnon-binaryfeaturevectorsarenormalizedbydividingtheirvaluewiththeoverallmaximumvalueinthedataset•  Queryrepresentationsareassociatedwithvaluesbetween1and0

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 145

Page 146: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

FEASIBLEBenchmarkGeneration•  QuerySelection

–  GivenanumberofNqueriestoselectasbenchmarkqueries–  AsetofcleanedandnormalizedqueriesL, |L| >>> |N| –  Computeanl-sizepartitionofqueriessuchthat

•  Theaveragedistancebetweentwopointsin2differentelementsinthepartitionishighand

•  Theaveragepointswithinapartitionissmall•  Selectthepointthatisclosetotheaverageofeachpartitionandincludeitinthebenchmark

–  Implementedby•  Selectingexemplars(pointsthatrepresentaportionofthespace)thatareasfaraspossiblefromeachother

•  PartitioningL bymappingeverypointofL tooneoftheseexemplarstocomputeapartitionofthespace

•  Selectingthemedoidofeachofthespacepartitionsasaqueryinthebenchmark

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 146

Page 147: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ApplesandOranges[DKS+11]•  Proposestructurednesstocharacterizedatasets

–  ThelevelofstructurednessofadatasetD,withrespecttoatype(class)T,isdeterminedbyhowwelltheinstancesofT,conformtotypeT

–  IfeachinstanceofThasthepropertiesdefinedinT,thenthedatasethashighstructurednesswithrespecttoT

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 147

0

1

2

3

4

5

6

name office ext major GPA

OC(p,I(T,D))

OC(p,T)foreachpropertypofT

0

1

2

3

4

5

6

name office ext major GPA

Highlystructureddataset

•  allinstanceshavethenameattribute•  ext&GPApropertiesencounteredin

50%oftheinstances•  οfficepropertyfoundin20%oftheinstances•  majorpropertyin10%oftheinstances

•  allinstanceshaveallattributes

Page 148: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ApplesandOranges[DKS+11]•  Oneofthekeyconsiderationswhiledeciding:–  appropriatedatarepresentationformat(e.g.,relationalforstructuredandXMLforsemi-structureddata)

–  organizationofdata(e.g.,dependencytheoryandnormalformsfortherelationalmodel,andXML).

–  dataindexes(e.g.,B+-treeindexesforrelationalandnumberingscheme-basedindexesforXML).

–  dataquerying(e.g.,usingSQLfortherelationalandXPath/XQueryforXML).

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 148

Inotherwords,structurednesspermeateseveryaspectofdatamanagement

Page 149: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ApplesandOranges[DKS+11]

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 149

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Highlystructured

datasets

(relationa

llike)

Lessstruc

turedda

tasets

SyntheticDatasets

RealDatasets

Page 150: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ApplesandOranges[DKS+11]

Someimportantobservations:•  SinceTPC-Hisarelational

dataset,itshouldhavehighstructuredness.

•  Thereisadifferentbetweensyntheticandandrealdatasets.

•  Syntheticarefairlystructuredandrelational-like

•  Realdatasetscoverthewholespectrumofstructuredness.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 150

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Structuredness of datasets

ExistingRDFstoresaretestedandcomparedagainsteachotherwithrespecttodatasetsthatarenotrepresentativeof

mostrealRDFdata.

CanwedesignabenchmarkthatgeneratesdatawhichmoreaccuratelyrepresentRDFdata?

Page 151: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

ApplesandOranges[DKS+11]•  Nothingcanbetterrepresentdatathanthedataitself!•  Idea:Turneverydatasetintoabenchmark

1.  Noneedtosyntheticallygeneratevalues•  Usetheactualdatavaluesinthedataset

2.  Noneedtosyntheticallygeneratequeries.•  Thequeriesthatareknowntoruninyourdatacanbe

usedinthebenchmark.3.  Butweneedtocoverthestructurednessspectrum•  togetdataascloseaspossibletotherealworlddata•  toseehowthesystemsperformwhendatagoesfrom

verystructuredtolessstructured

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 151

Page 152: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

CountingCoins[DKS+11]

•  StartwithadatasetwithsizeSandCH=0.5

•  AimforadatasetwithsizeS’andCH’,whereS>S’andCH>CH’.

Process:

•  Assignacointoeachtriple(s,p,o)andcomputetheimpactinCHofitsremoval

–  Theremovalwillimpactthesizeby1.Example:Consider(person1,ext,x5304).RemovingthetriplefromDgivesadatasetwithCH(T,D)=0.467.Thereforethecoin(person1,ext,x5304)=0.5–0.467=0.033.

•  Formulate(automatically)anintegerprogrammingproblemwhosesolutionswilltellushowmanycoinstoremovetoachievethedesiredcoherenceCH’andsizeS’.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 152

subject predicate object

person0 name Eric

person0 office BA7430

person0 ext x4401

person1 name Kenny

person1 office BA7349

person1 office BA5439

person1 ext x5304

person2 name Kyle

person2 ext x6281

person3 name Timmy

person3 major C.S.

person3 GPA 3.4

person4 name Stan

person4 GPA 3.8

person5 name Jimmy

person5 GPA 3.7 Oneofthefewoccasionsinlifewherehavingtoomanycoinsisundesirable…

Page 153: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

Technicalchallengesinproblemformulation•  Computecoinswhichrepresenttheimpactonstructuredness

ofremovingalltripleswithsubjectsthatareinstancesofatypeTwithpropertiesequaltop–  Thereforeonecoinforeachtype/propertycombination.

•  Addconstraintsthatsetlowerandupperboundsonthenumberofcoinsthatcanberemovedsoasnottocompletelyremoveapropertyfromatype.

•  Addconstraintswhichguaranteethatnotallinstancesofatypeareremoved.

•  Todealwemulti-valuedproperties,weaddconstraintsthatintroducearelaxationparameterρ–  requiredbecauseoftheapproximationbyusingtheaveragenumberoftriplespercoin.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 153

Page 154: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

Overview•  IntroducingBenchmarks•  AshortdiscussionaboutLinkedData–  ResourceDescriptionFramework(DataModel)–  SPARQL(QueryLanguage)

•  BenchmarkingPrinciples&ChokePoints•  Benchmarks–  Synthetic–  Real–  BenchmarkGenerators

•  Sumup:whatdidwelearntoday?

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 154

Page 155: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

References•  [NPM+12]R.OthayothNambiar,M.Poess,A.Masland,H.R.Taheri,M.

Emmerton,F.Carman,andM.Majdalany.TPCBenchmarkRoadmap2012.InTPCTC,2012.

•  [L97]CharlesLevine.TPC-C:TheOLTPBenchmark.InSIGMOD-IndustrialSession,1997.

•  [Castro04]R.lGarcia-CastroandA.Gomez-Perez.AMethodforPerforminganExhaustiveEvaluationofRDF(S)Importers.InWISE2005Workshops,2005.

•  [W90]R.P.Weicker.Anoverviewofcommonbenchmarks.Computer,23(12):65–75,December1990.

•  [G93]J.Gray,editor.TheBenchmarkHandbookforDatabaseandTransactionSystems(2ndEdition).MorganKaufmann,1993.

•  [H09]K.Huppler.TheArtofBuildingaGoodBenchmark.InTPCTC,2009.•  [GPH05]Y.Guo,Z.Pan,andJ.Heflin.LUBM:ABenchmarkforOWLKnowledge

BaseSystems.JournalWebSemantics:Science,ServicesandAgentsontheWorldWideWebarchiveVolume3Issue2-3,October,2005,Pages158-182

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 1

Page 156: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

References•  [BNE14]P.Boncz,T.Neumann,O.Erling.TPC-HAnalyzed:HiddenMessagesand

LessonsLearnedfromanInfluentialBenchmark.PerformanceCharacterizationandBenchmarking.InTPCTC2013,RevisedSelectedPapers.

•  [SHM+09]M.Schmidt,T.Hornung,M.Meier,C.Pinkel,G.Lausen.SP2Bench:ASPARQLPerformanceBenchmark.SemanticWebInformationManagement,2009.

•  [FOAF]FriendofAFriend.http://www.foaf-project.org/•  [SWRC]SWRCOntologyhttp://ontoware.org/swrc/•  [DC]DublinCoreMetadataInitiative.http://dublincore.org/•  [BS09]C.BizerandA.Schultz.TheBerlinSPARQLBenchmark.Int.J.Semantic

WebandInf.Sys.,5(2),2009.•  [BSBM]BerlinSPARQLBenchmark(BSBM)Specification-V3.1.http://

wifo5-03.informatik.unimannheim.de/bizer/berlinsparqlbenchmark/spec/index.html.

•  [RU09]N.RedaschiandUniProtConsortium.UniProtinRDF:TacklingDataIntegrationandDistributedAnnotationwiththeSemanticWeb.InBiocurationConference,2009.

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 2

Page 157: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

References•  [UniProtKB]UniProtKBQueries.http://www.uniprot.org/help/query-fields.•  [NW10]T.NeumannandG.Weikum.TheRDF-3Xengineforscalable

managementofRDFdata.TheVLDBJournal,19(1),2010•  [NW09]T.NeumannandG.Weikum..ScalablejoinprocessingonverylargeRDF

graphs.InSIGMOD2009•  [SKW07]F.M.Suchanek,G.KasneciandG.Weikum.YAGO:ACoreofSemantic

KnowledgeUnifyingWordNetandWikipedia,InWWW2007.•  [Barton]TheMITBartonLibrarydataset.http://simile.mit.edu/rdf-test-data/•  [PHS10]H.Patni,C.Henson,andA.Sheth.Linkedsensordata.2010•  [TPC-H]TheTPC-HHomepage.http://www.tpc.org/tpch/•  [WordNet]WordNet:AlexicaldatabaseforEnglish.http://

wordnet.princeton.edu/•  [XMark]AnXMLBenchmarkProject.http://www.xml-benchmark.org/•  [MLA+14]M.Morsey,J.Lehmann,S.Auer,A-C.NgongaNgomo.DBpedia

SPARQLBenchmark-PerformanceAssessmentwithRealQueriesonRealData.ISWC,2011

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 3

Page 158: Assessing the performance of RDF Engines: Discussing RDF ... · – datasets (synthetic or real) – set of software tools • synthetic data generators • query generators – performance

References•  [NS11]A–C.NgongaNgomo,S.Auer.LIMES:atime-efficientapproachforlarge-scale

linkdiscoveryonthewebofdata.IJCAI'11.•  [NS09]A–C.NgongaNgomoandD.Schumacher.Borderflow:Alocalgraphclustering

algorithmfornaturallanguageprocessing.InCICLing,2009.•  [AHO+14]G.Aluc,O.Hartig,T.Ozsu,K.Daudjee.DiversifedStressTestingofRDFData

ManagementSystems.InISWC,2014.•  [SMN15]M.Saleem,Q.Mehmood,andA–C.NgongaNgomo.FEASIBLE:AFeature-

BasedSPARQLBenchmarkGenerationFramework.ISWC2015.•  [DKS+11]S.Duan,A.Kementsietsidis,KavithaSrinivasandOctavianUdrea.Applesand

oranges:acomparisonofRDFbenchmarksandrealRDFdatasets.InSIGMOD,2011.•  [Wilkinson06]K.Wilkinson.Jenapropertytableimplementation.InSSWS,2006.•  [AMM+07]DanielJ.Abadi,AdamMarcus,SamuelMadden,KatherineJ.Hollenbach:

ScalableSemanticWebDataManagementUsingVerticalPartitioning.VLDB2007:411-42

•  [BDK+13]MihaelaA.Bornea,JulianDolby,AnastasiosKementsietsidis,KavithaSrinivas,PatrickDantressangle,OctavianUdrea,BishwaranjanBhattacharjee:BuildinganefficientRDFstoreoverarelationaldatabase.SIGMODConference2013:121-132

5/27/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 4