Searching Linked Data
-
Upload
thanh-tran -
Category
Education
-
view
179 -
download
0
description
Transcript of Searching Linked Data
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association1
Searching Linked DataFrom Finding Relevant Sources to Computing AnswersInvited Presentation @ International Workshop on Scalable Semantic Computing, Hangzhou, China, November 2010.
Thanh Tran, Günter Ladwig, Veli Bicer, Lei Zhang, Daniel Herzig, Yongtao Ma, Andreas Wagner, Rudi Studer from AIFB Institute, KIT
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Searching Linked Data
Opportunities & challenges
Keyword Query Routing
Problem Definition
Summary Models
Experiments
Linked Data Query Processing
Combining Top-down & Bottom-up
Stream-based Query Processing
Corrective Source Ranking
Conclusions
2
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Linked Data
- 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links- As of 09-2010 + other linked data not covered by LOD cloud
3
More Data
More Links
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Opportunities
4
“Articles from awarded researchers at Stanford ”
Freebase contains data about people DBPedia contains information about awards DBLP contains bibliographic data
More Data
More Links
More complex information needs More precise results More integrated results
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Problems“Articles from awarded researchers at Stanford ”
z) n(x,publicatio Stanford) name(y, y) worksAt(x, Award) Turing prizes(x,.,).( yxz
Formulating queries is a hard task!• Which data sources?• Which schema elements?
Processing queries is expensive!• Process against all data sources? • Explore all links to other sources?
Large number of unknown, unexplored & irrelevant sources! What is in there? What is out there? What is relevant?
USABILITY SCALABILITY
5
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Searching Linked Data
Given the needs (expressed as sets of keywords), are there answers in linked data? what combination of data sources produce them? how to incorporate related unexplored linked sources?
6
Identify valid combination of sources
Identify schema elements
Let user choose combination of sources
Focus on this combination of sources and explore related linked sources
Keyword Query Routing to Relevant Linked Data Sources
Focused, Adaptive and Stream-based Linked Data Query Processing (c.f. LARKC)
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Searching Linked Data
Opportunities & challenges
Keyword Query Routing
Problem Definition
Summary Models
Experiments
Linked Data Query Processing
Combining top-down & bottom-up
Stream-based query processing
Corrective source ranking
Conclusions
7
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
LOD Data Graph
8
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Web data modeled as a set of interlinked data graphs Each data graph represent a source Data graph vs. schema graph vs. source graph
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
LOD Schema Graph
9
Author
University
Person Person Prize
authoremploy
sameAs sameAs prizes
Written Work
author
Article
Web data modeled as a set of interlinked data graphs Each data graph represent a source Data graph vs. schema graph vs. source graph
DBLPFreebase DBPedia
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
LOD Source Graph
10
Web data modeled as a set of interlinked data graphs Each data graph represent a source Data graph vs. schema graph vs. source graph
DBLPFreebase DBPedia
sames sameAs
author
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Keyword Query Answers
11
), dD,Q,F,R(q ji
User information need award“„stanford article
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Article
type
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Problem Definition
Keyword query result (also called Steiner graph) is a subgraph of data graph that for every keyword, contains a matching data element (called keyword elements), and these elements are pairwise connected over a path.
12
d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.
Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its union set of sources produces non-empty keyword query results.
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
A Valid Keyword Routing Plan
13
), dD,Q,F,R(q ji
User information need award“„stanford article
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Article
type
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Searching Linked Data
Opportunities & challenges
Keyword Query Routing
Problem Definition
Summary Models
Experiments
Linked Data Query Processing
Combining Top-down & Bottom-up
Stream-based Query Processing
Corrective Source Ranking
Conclusions
14
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Keyword Sets
16
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turing
Award
Smith Music
One keyword set for every data source Elements stand for distinct keywords mentioned in a source
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Element-level Keyword-Element Relationship Graph (E- KERG)
17
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
A keyword-element captures a keyword k and the data element mentioning k A relationship between two keyword-elements exists iff there is a path between
their associated data elements In d-max KERG, the paths to be considered have length d-max or less
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Schema-level Keyword-Element Relationship Graph (S-KERG)
18
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k
A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements
Groups ele. (rel.) when they capture same keyword (rel. between same classes)
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4
University Person Author
Article Person Prize
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Data-Source-level Keyword-Element Relationship Graph (D-KERG)
19
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k
A relationship between two keyword-elements exists if there is a path between some instances of their associated sources
Groups ele. (rel.) when they capture same keyword (rel. between same sources)
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4
University Person Author
Article Person Prize
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Searching Linked Data
Opportunities & challenges
Keyword Query Routing
Problem Definition
Summary Models
Experiments
Linked Data Query Processing
Combining Top-down & Bottom-up
Stream-based Query Processing
Corrective Source Ranking
Conclusions
21
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Experiments
Chunk of the BTC dataset containing 10M RDF triples from 154 sources, linked via 500K mappings
22
Manually crafted 30 keyword valid multi-data-source queries, i.e., produce non-empty keyword answers and involve more than 2 sources Town River America Beijing Conference Database 2007
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Validity
P@k measure the percentage of plans that are valid out of the top-k plans P@5 for KS only 6%, P@5 up to 100% for E-KERG (dmax =4) More valid plans were computed when a higher value was used for dmax
dmax =3 seems to be a good tradeoff Queries with larger number of keywords resulted in lower precision
23
2 3 4 50.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 E-KERG D-KERG
S-KERG KS
|K|
P@5
0 1 2 3 40.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 E-KERG
D-KERG
S-KERG
KS
dmax
P@5
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Performance
24
Times increased with higher values for dmax
Sharp for E-KERG and S-KERG Relatively stable for D-KERG
Times increase with number of keywords All other models had poor performance w.r.t complex queries but D-KERG E-KERG needed more than 100s for queries with more than 2 keywords
Time for D-KERG was no more than 10ms on average
0 1 2 3 41
10
100
1000
10000
100000
1000000
S-KERG D-KERG KS E-KERG
dmax
Que
ry P
roce
ssin
g Ti
me
(ms)
2 3 4 51
10
100
1000
10000
100000
1000000
S-KERG D-KERG KS E-KERG
|K|
Que
ry P
roce
ssin
g Ti
me
(ms)
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Searching Linked Data
Opportunities & challenges
Keyword Query Routing
Problem Definition
Summary Models
Experiments
Linked Data Query Processing
Combining Top-down & Bottom-up
Stream-based Query Processing
Corrective Source Ranking
Conclusions
27
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Mixed Query Processing Strategy
Combination of top-down and bottom-up strategies Top-down: partial local index of sources, not assumed to
be complete Bottom-up: new sources are discovered at run-time
Corrective Source Ranking Deal with heterogeneous source descriptions Adaptive re-ranking
Stream-based Query Processing Deal with unpredictable nature of Linked Data access
ISWC 2010, Shanghai, China
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Searching Linked Data
Opportunities & challenges
Keyword Query Routing
Problem Definition
Summary Models
Experiments
Linked Data Query Processing
Combining Top-down & Bottom-up
Stream-based Query Processing
Corrective Source Ranking
Conclusions
29
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Query Plan
Source Retrieval
Stream-based Query Processing
Compile-time Construct query plan Probe local index for
sources Network latency
Do not block! Evaluation driven by
incoming data
Run-time Retrieve sources Push data into query plan Discover new sources Rank sources
ISWC 2010, Shanghai, China
Join
Join
worksAt(?x, dbpedia:KIT) knows(?x, ?y)
name(?y, ?n)
Results
Source Retriever 1
Source Retriever 2
...
Push
Source RankerRetrievesource
Sourcediscovered
Source 1 (score: 1.0)Source 2 (score: 0.7) ...
Samples
Local source index
Linked Data
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Searching Linked Data
Opportunities & challenges
Keyword Query Routing
Problem Definition
Summary Models
Experiments
Linked Data Query Processing
Combining Top-down & Bottom-up
Stream-based Query Processing
Corrective Source Ranking
Conclusions
31
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Corrective Source Ranking
Prefer more relevant sources Relevancy of a source is based on
Current query Any available intermediate results Overall optimization goal
Define a set of source features and derive concrete source metrics Not all metrics are available for all sources (heterogeneity)
Refine previously computed metrics using newly discovered information (intermediate results, samples)
ISWC 2010, Shanghai, China
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Evaluation
Three systems: top-down (TD), bottom-up (BU), mixed (MI) 8 queries over various datasets (DBpedia, Geonames, NYT) To make the approaches comparable, sources were restricted to
those discoverable by the BU approach ~6200 sources, containing ~500k triples
Sources hosted on local proxy server with artificial delay of 2 seconds 25% of sources were randomly chosen to construct index for MI
ISWC 2010, Shanghai, China
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Results
ISWC 2010, Shanghai, China
Query 1 Query 6
BU MI TD BU MI TD
25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.0
50% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0
Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0
Src. Selection 0.0 853.0 1444.5 0.0 1331.0 1863.5
Ranking 25.5 2404.0 411.5 23.5 292.5 335.0
Overall early result reporting25% results: MI 8.7s, BU 15.1s50% results: MI 12.8s, BU 22.0sImprovement of ~42%
Detailed results for two queries:
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Result Arrival Times
ISWC 2010, Shanghai, China
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Searching Linked Data
Opportunities & challenges
Keyword Query Routing
Problem Definition
Summary Models
Experiments
Linked Data Query Processing
Combining Top-down & Bottom-up
Stream-based Query Processing
Corrective Source Ranking
Conclusions
39
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Conclusions
40
Keyword query routing Helps users without knowledge of linked data and schemas to
find combination of sources that contain answers corresponding to their needs
Focus on relevant combinations Summarizing at the level of sources (D-KERG) represents the
most practical trade-off, produces results in less than 10ms out of which every second one was valid
Stream-based query processing helps to deal with unpredictable nature of Linked data
Corrective, mixed strategy that incorporate new sources and knowledge at run-time for optimization (source ranking) helped to report early results 42% faster on average
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Thanks for Your Attention!
Institute AIFB, KIT
41