Scaling to Millions of Concurrent SPARQL Queries on the Cloud
SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data
-
Upload
olafgoerlitz -
Category
Documents
-
view
1.274 -
download
0
description
Transcript of SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data
Institute for Web Science and Technologies
University of Koblenz ▪ Landau, Germany
Systematic Generation of
SPARQL Benchmark Queries
for Linked Open Data
Olaf Görlitz, Matthias Thimm, Steffen Staab
ISWC'12, Boston, 11/15/2012Slide 2
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Linked Data Federation
SPARQL Queries on the Linked Data Cloud
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
ISWC'12, Boston, 11/15/2012Slide 3
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
distributedqueries
federationimplementation
The Problem
Why not usebenchmarkqueries?
ISWC'12, Boston, 11/15/2012Slide 4
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
RDF Benchmarks
LUBM, BSBM, SP²B, ...
• Synthetic datasets• Domain-specific• Highly structured• Sophisticated queries
FedBench (ISWC'11)
• 10 Linked Data sets(~170M triples)
• 25 handpickeddistributed queries
Centralized Fixed
Scalable, Flexible, ExpressiveLinked Data Benchmark
ISWC'12, Boston, 11/15/2012Slide 5
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Overview
Benchmark Idea Methodology Evaluation
ISWC'12, Boston, 11/15/2012Slide 6
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Linked Data Benchmark Features
Scalability Flexibility Expressiveness
Real Linked Data Sets Customization Typical+Complex Queries
Systematic SPARQL Benchmark Query Generator for Linked Open Data
Systematic SPARQL Benchmark Query Generator for Linked Open Data
ISWC'12, Boston, 11/15/2012Slide 7
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Requirements
1. Define QueryCharacteristics
2. Automatic Query Generation
3. Query Validation
What we want:
Customize Benchmark
Random Queries
#results > 0
ISWC'12, Boston, 11/15/2012Slide 8
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Contribution
Methodology and toolset forsystematic query generation
Query Generation Query ValidationParameterization
Linked Data
Config BenchmarkQueries
ISWC'12, Boston, 11/15/2012Slide 9
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Overview
Benchmark Idea Methodology Evaluation
ISWC'12, Boston, 11/15/2012Slide 10
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
Define typical + challenging distributed queries
QueryGeneration
QueryValidation
QueryGeneration
QueryValidation
No federation query logs available
Analyze queries of benchmarks
SELECT ?drug ?keggUrl ?chebiImage WHERE { ?drug rdf:type drugbank:drugs . ?drug drugbank:keggCompoundId ?keggDrug . ?keggDrug bio2rdf:url ?keggUrl . ?drug drugbank:genericName ?drugBankName . ?chebiDrug purl:title ?drugBankName . ?chebiDrug chebi:image ?chebiImage . }
FedBench/LifeScience#5
ISWC'12, Boston, 11/15/2012Slide 11
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
• Query Form(Select, Construct, ...)
• Join Type(conj. / disj. / left-join)
• Result Modifiers(limit, offs, order by)
• Variable Patterns(s, o, s+o, ...)
• Join Patterns(star, path)
• Cross Product
• # Data Sources
• # Joins/ Patterns
• # Results
Algebra Structure Cardinality
ISWC'12, Boston, 11/15/2012Slide 12
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
Main query parameter: join structure
FedBench queries star join
path join
ISWC'12, Boston, 11/15/2012Slide 13
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
Path-join: n triple patterns,m sources (m≤n)
Additional query parameters: # triple patterns# data sourcesresult size...
Star-join: n triple pattern,anchor node (s/o)
ISWC'12, Boston, 11/15/2012Slide 14
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
Iteratively add random triple pattern
Need background knowledge level of detail?
#results > 0 ?
Predicate combinations how provided?
owl:sameAs rdf:type
rdfs:label
foaf:knows
ISWC'12, Boston, 11/15/2012Slide 15
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
owl:sameAs rdf:type
rdfs:label
foaf:knows
Linked Predicates Characteristics Sets*
(owl:sameAs → rdf:type)
DBpedia → geonames (43, 58)freebase → DBpedia (86, 72) ...
{rdfs:label, foaf:knows, …}
DBpedia (322), rdfs:label (437)foaf:knows (322)
...
*[Neumann, Moerkotte, ICDE 2011]
ISWC'12, Boston, 11/15/2012Slide 16
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
Linked Predicates Characteristics Sets
(p1 → p
2)
p1
p2
p3
p4
⊗ (p2 → p
3)
⊗ (p3 → p
i )
{p1, p
4}
{p1, p
4, ...}
ISWC'12, Boston, 11/15/2012Slide 17
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
How to evaluate? Compute confidence value
Verify generated queries (#results >0)
minimum join selectivity > e
ISWC'12, Boston, 11/15/2012Slide 18
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Overview
Benchmark Idea Methodology Evaluation
ISWC'12, Boston, 11/15/2012Slide 19
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Objective
Verify generation of valid queries (#results >0) Compare variations of query generation algorithms
Metrics: #queries with non-empty results #result per query
Baseline SPLODGElite SPLODGE
“random“predicate
backgroundknowlege
+ minimum join selectivity(> 10-4/10-3/10-2)
ISWC'12, Boston, 11/15/2012Slide 20
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Setup
Real Linked Data Random queries Triple Store
SELECT * WHERE {?var1 <http://dbpedia.org/property/description> ?var2 .?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 .?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 .?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 .?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6
}
Billion Triple Challenge Dataset
• Path-joins across data sources• 3-6 patterns, bound predicates• 100 queries per batch
RDF3X
ISWC'12, Boston, 11/15/2012Slide 21
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Results
Joined triple patterns
#que
ries
ISWC'12, Boston, 11/15/2012Slide 22
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Results
Joined triple patterns
#res
ults
ISWC'12, Boston, 11/15/2012Slide 23
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Estimated vs. actual results size
estimated result size
actu
al r
esul
t siz
e
ISWC'12, Boston, 11/15/2012Slide 24
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Predicate Occurrence in Queries
ISWC'12, Boston, 11/15/2012Slide 25
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Conclusion
SPLODGE provides Flexible query characterization + parameterization Methodology for Systematic & Scalable Query Generation Toolset as Open Source (http://code.google.com/p/splodge/)
Future Work: Create a LOD Federation Benchmark Interactive SPARQL query construction
Questions?
ISWC'12, Boston, 11/15/2012Slide 26
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Evaluation Setup
BTC 2011 dataset in RDF3X pure triples, no context 160 GB repository file
(14h loading, 200 GB tmp mem)
ISWC'12, Boston, 11/15/2012Slide 27
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Pre-Processing for BTC data
Identify common domains(e.g. jane08.lifejournal.com/home) 3,0 h17 GB gzip
Replace quad context(reduce number of sources)
4,4 h
Sort quads + remove duplicates 8,5 h
Build predicate/context dictionary 1,0 h<1 MB gzip
Create resource in/out-link index 9,7 h1.7 GB gzip
Create linked predicate stats Compute characteristic sets 1,6 h