RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur...
-
Upload
vernon-dickerson -
Category
Documents
-
view
224 -
download
0
Transcript of RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur...
![Page 1: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/1.jpg)
RDF-3X : a RISC-style Engine for RDF
Thomas Neumann, Gerhard Weikum
Max-Planck-Institute fur Informatik, Max-Planck-Institute fur InformatikPVLDB ‘08
May 25 2011Presented by Somin Kim
![Page 2: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/2.jpg)
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion
2/30
![Page 3: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/3.jpg)
Introduction (1/3)
Motivation and Problem RDF (Resource Description Framework)
– A flexible representation of schema-free information for the semantic web
– (subject, predicate, object) or (subject, property, value)– All RDF triples together can be viewed as a large graph– The notion of RDF triples fits well with “pay as you go” phi-
losophy
3/30
![Page 4: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/4.jpg)
Introduction (2/3)
Motivation and Problem Technical challenges for managing large-scale RDF
data– Physical database design– Prediction of join attributes– Suitable granularity of statistics gathering – RDF triples form a graph rather than a collection of trees
4/30
![Page 5: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/5.jpg)
Introduction (3/3)
Contribution and Outline RDF-3X (RDF Triple eXpress)
– A novel architecture for RDF indexing and querying, eliminat-ing the need for physical database design
Key principles of RDF-3X– Physical design is workload-independent
By creating appropriate indexes over a single, giant “triple ta-ble”
– The query processor is RISC-style By relying mostly on merge joins over sorted index lists
– The query optimizer employs dynamic programming for plan enumeration
5/30
![Page 6: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/6.jpg)
Outline Introduction Background and State of the Art
– SPARQL– Related Work
Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion
6/30
![Page 7: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/7.jpg)
Background and State of the Art (1/4)
SPARQL The official standard for searching over RDF storages Each pattern consists of S, P, O and each of these is
either a variable or a literal
Two query modifiers of SPARQL– distinct keyword : duplicates must be eliminated– reduced keyword : duplicates may but need not be elimi-
nated
SELECT ?var1 ?var2…WHERE {
pattern1. pattern2. … }
SELECT ?titleWHERE {
?m <hasTitle> ?title; <hasCasting> ?c. ?c <Actor> ?a. ?a <hasName> “Johnny Depp” }
7/30
![Page 8: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/8.jpg)
Background and State of the Art (2/4)
Related Work Triple table
– All triples are stored in a single table
SELECT ?titleWHERE {
?book <title> ?title.?book <author> <Fox, Joe>.?book <copyright> <2001>
}
8/30 Based on JS Myoung’s presentation slide
![Page 9: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/9.jpg)
Background and State of the Art (3/4)
Related Work Property table
– Triples are grouped by their predicate name
subject
property
object
9/30
![Page 10: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/10.jpg)
Background and State of the Art (4/4)
Related Work Cluster-property table
– Triples are clustered by properties that tend to be defined to-gether
10/30
![Page 11: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/11.jpg)
Outline Introduction Background and State of the Art Storage and Indexing
– Triple Store and Dictionary– Compressed Indexes– Aggregated Indexes
Query Processing and Optimization Selectivity Estimates Evaluation Conclusion
11/30
![Page 12: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/12.jpg)
Storage and Indexing (1/7)
Triples Store and Dictionary RDF-3X is based on a single, giant “triples table”
Mapping Dictionary– Replacing all literals by ids using a mapping dictionary– It compresses the triple store by containing only id triples
S P O
ob-ject214
hasColor blue
ob-ject214
be-longsTo
ob-ject352
… … …S P O
0 1 2
0 3 4
… … …
ID Value
0 ob-ject214
1 hasColor
… …12/30
![Page 13: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/13.jpg)
Storage and Indexing (2/7)
Triples Store and Dictionary Store all triples in a clustered B+-tree
– Triples are sorted lexicographically– It allows the conversion of SPARQL patterns into range scans
002 …
000 001 002 003
ID Value
0 ob-ject214
1 hasColor
… …
<Mapping Dictionary>
S P O
0 1 2
0 3 4
… … …
Actually, we don’t need this table!
( literal1, literal2, ?x )
13/30
![Page 14: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/14.jpg)
Storage and Indexing (3/7)
Compressed Indexes We relied on the fact that the variables are a suffix
– <S>-<P>- ?var or <S>-?var1 -?var2
To guarantee that we can answer every possible pat-tern with variables in any position by merely perform-ing a single index scan, we maintain all six permuta-tions of S, P and O in six separate indexes– (SPO, SOP, OSP, OPS, PSO, POS)– We can afford this level of redundancy
<POS>
?var - <P> - <O>
14/30
![Page 15: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/15.jpg)
Storage and Indexing (4/7)
Compressed Indexes Instead of storing full triples, we only store the
changes between triples– The collation order causes neighboring triples to be very sim-
ilar
We use a byte-level compression scheme– The algorithm computes the delta to the previous tuple– If delta is small, it is directly encoded in the header byte– Otherwise, it computes the delta value, write the header
byte with the size information and write the non-zero tail of the delta
15/30
![Page 16: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/16.jpg)
Storage and Indexing (5/7)
Compressed Indexes Comparison of byte-wise compression vs. bit-wise
compression for the Barton dataset
Each leaf page is compressed individually– It allows us to seek to any leaf page and directly start read-
ing triples– The compressed index behaves just like a normal B+-tree
16/30
![Page 17: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/17.jpg)
Storage and Indexing (6/7)
Aggregated Indices For many SPARQL patterns, indexing partial triples
rather than full triples would be sufficient
Aggregated indexes– Each aggregated indexes store only two out of the three col-
umns of a triple (value1, value2, count ) This is done for (SP, PS, SO, OS, PO, OP)
– All three one-value indexes (value1, count) This is done for (S, P, O)
select ?a ?cwhere { ?a ?b ?c }
17/30
![Page 18: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/18.jpg)
Storage and Indexing (7/7)
SPO
SPO
SOP
SO
P
PSO
PS
O
POS
PO
S
OSP
OS
P
OPS
OPS
Triple Index
Count
SP
Count
SO
Count
PS
Count
PO
Count
OP
Count
OS
Count
S
Count
P
Count
O
Aggregate Index
18/30 Based on KS Kim’s presentation slide
![Page 19: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/19.jpg)
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization
– Translating SPARQL Queries– Optimizing Join Ordering
Selectivity Estimates Evaluation Conclusion
19/30
![Page 20: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/20.jpg)
Query Processing and Optimization (1/2)
Translating SPARQL Queries Each query can be parsed and expanded into a set
of triple patterns The parser performs dictionary lookups, so the literals
are mapped into ids When a query consists of
– a single pattern Use index structures and answer the query with a single range
scan
– multiple triple pattern Join the results of the individual patterns
When a query includes the distinct option , we elimi-nates duplicates in the result
Finally, a dictionary lookup operator converts the re-sulting ids back in to strings
20/30
![Page 21: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/21.jpg)
Query Processing and Optimization (2/2)
Optimizing Join Ordering Demanding properties
– Bushy join trees (rather than left-deep or right-deep trees)– Fast plan enumeration and cost estimation – Extensive use of merge joins
DP framework– To find best plan, consider all possible plans of subsets– Recursively compute costs for joining subsets to find the
cost of each plan– When plan for any subset is computed, store it and reuse it– Larger plans are created by joining optimal solutions of
smaller problems
21/30
![Page 22: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/22.jpg)
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates
– Selectivity Histograms Evaluation Conclusion
22/30
![Page 23: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/23.jpg)
Selectivity Estimates Estimated cardinalities and selectivities have a huge
impact on plan generation
Selectivity Histograms– The cardinality of a single
triple pattern Using aggregated in-
dexes– The numbers of the join
partners
Frequent join path
23/30
![Page 24: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/24.jpg)
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation
– General Setup– Query Run-times
Conclusion
24/30
![Page 25: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/25.jpg)
Evaluation (1/3)
General Setup Setup
– 2GHz dual core, 2GB RAM, 30MB/s disk, Linux
Competitors– MonetDB
Column-store-based approach Presented in VLDB07, by Abadi et al.
– PostgreSQL Triple store with SPO, POS, PSO indexes, similar to Sesame
– Other approaches performed much worse Jena2, Yars2(DERI)
25/30
![Page 26: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/26.jpg)
Evaluation (2/3)
General Setup Datasets
– Barton, library data, 51M triples (4.1GB)– Yago, Wikipedia-based ontology, 40M triples (3.1GB)– LibraryThing(partial crawl), tags that users have assigned to
the books, 30M triples (1.8GB)
DB load time & DB size
26/30
![Page 27: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/27.jpg)
Evaluation (3/3)
Query Run-times Average run-times for cold caches (sec)
Average run-time for warm caches (sec)
Barton Yago LibraryThing
RDF-3X 5.9 0.7 0.89
MonetDB 26.4 78.2 8.16
PostgreSQL 167.8 10.6 93.90
Barton Yago LibraryThing
RDF-3X 0.4 0.04 0.13
MonetDB 4.8 54.60 4.39
PostgreSQL 64.3 0.56 30.40
27/30
![Page 28: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/28.jpg)
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion
28/30
![Page 29: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/29.jpg)
Conclusion RDF-3X is a fast and flexible RDF/SPARQL engine
– Exhaustive but very space-efficient triple indexes– Avoids physical design tuning, generic storage– Fast runtime system, query optimization has a huge impact
29/30
![Page 30: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.](https://reader035.fdocuments.net/reader035/viewer/2022081514/56649f505503460f94c721f9/html5/thumbnails/30.jpg)