Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
-
Upload
exascale-infolab -
Category
Technology
-
view
526 -
download
0
description
Transcript of Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for
Ad-hoc Object RetrievalAlberto Tonon, Gianluca Demartini, Philippe Cudré-
Mauroux
eXascale Infolab - University of Fribourg - Switzerland{firstname.lastname}@unifr.ch
SIGIR2012 - Monday, August 13th 2012
2
Motivation• Lot of search engines queries are
about entities.
• Increasingly large amount of entity data online.
• Often represented as huge graphs
• e.g. the LOD cloud, Google Knowledge Graph, Facebook social graph.
• Globally unique Entity identifiers (e.g., URIs) .
• Hard to discover and/or memorize.
3
Ad-hoc Object Retrieval(informal definition)
• “Given the description of an entity, give me back its identifier”
• Description can be keywords (e.g., “Harry Potter”).
• More than one identifier per entity (e.g., dbpedia + freebase).
• How to evaluate returned results?
Ad-hoc Object Retrieval(formal definition by Pound et al.)
• Input: unstructured query q and data graph G.
• Output: ranked list of resource identifiers (URIs) from G.
• Evaluation: results (URIs) scored by a judge with access to all the information contained in or linked to the resource.
• Standard collections exist.
+
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallows/
1. http://harrypotter.wizards.pro/
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallows/
1. http://harrypotter.wizards.pro/
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.uk
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.uk
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
5
Overview of Our SolutionInverted indices
on the LOD Cloud...
...and RDF store containing the data.
Simple NLP techniques,Autocompletion,Pseudo-relevance feedback BM25,
BM25F
6
Pseudo-Relevance FeedbackNLP techniques
Query auto-completion
A Simple Example
SIGIRSIGIR
Graph traversals
Final ranking function
2. http://freebase.com/…/sigir3. http://dbpedia.org/…/IRAQ…
1. http://dbpedia.org/…/SIGIR
Which properties should we follow?
How to rank new results?
II + ranking function(s)
2. http://dbpedia.org/…/IRAQ3. …
…
1. http://dbpedia.org/…/SIGIR
How to build the
II?
7
Outline1. Inverted Indices
2. Graph Based Entity Search
1. Object Properties vs Datatype Properties
2. Properties to Follow
3. Experimental Results
1. Experimental Setting
2. IR Techniques: Experimental Results
3. Evaluation of the Hybrid Approaches
4. Overhead of the Graph Traversal
8
1. Inverted Indices (IIs)• Simple inverted index:
• index all literals attached to each node in the input graph.
• “movie” → http://…types/film
• Structured inverted index with three fields:
• URI - tokenized URIs identifying entities.
• Label - manually selected datatype properties to textual descriptions of the entity (e.g., label, title, name, full-name, …).
• Attributes - all other literals. BM25(F), query auto-completion, query extension, relevance feedback8
9
New URIs
...
2. Graph-Based Entity Search
IR results
...
...N
p1
p2
p_m
p1
p2
p_m
sim(e, q) > τ?
...
Assign Scores
0.284
1.428
0.556
Merged Re-Ranked Results
...
Take top-N docs.
Follow links/properties and get new URIs.
Filter new results by text similarity wrt the user query.
Scoring functions:count sim > τ,avg sim > τ,Sum sim,Avg sim,Sum BM25 - ε
10
2. 1. Object Properties vsDatatype Properties
• Object Properties:• connect different entities
• explore all the graph
• Datatype properties:• give additional info about entities
• explore just the neighborhood of a node
11
2.2. properties to follow
• RDF graph queried with SPARQL queries.
• Scope 1 queries vs Scope 2 queries.
• Set of predicates to follow selected using:
• Common sense (e.g., sameAs)
• Statistics from the data
12
properties to follow: Two Examples
Entry point given by
the II
13
3. Experimental results
14
3.1 Experimental Setting• SemSearch 2010 and 2011 testsets:
• Billion Triple Challenge 2009 (BTC2009)
• 1.3 billions RDF triples crawled from the LOD cloud.
• 92 and 50 queries, respectively.
• Evaluation of systems with depth-10 pooling by means of crowdsourcing.
• Measures taken into consideration: Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), early Precision (P10)
15
Completing Relevance by Crowdsourcing Judgements
• We obtained relevance judgments for unjudged entities in the top-10 results of our runs by using Amazon MTurk.
• To be fair we used the same design and settings that were used for the AOR task of SemSearch.
16
3.2. IR Techniques: Experimental ResultsOur
Baseline.
18
3.3. Evaluation of Hybrid Approaches
N = 3, τ = 0, score = sumBM25 - ε
19
3.4. Overhead of the Graph traversal
• Time in milliseconds needed for each part of the hybrid approaches.
• Measures taken on a single machine with cold cache.
Surprisingly small overhead (17% for best results).
20
Conclusions
• AOR = “Given the description of an entity, give me back its identifier”
• Disappointing results using simple IR techniques for AOR task.
• Hybrid system for AOR:
• combining classic IR techniques + structured database storing graph data.
• Our evaluation shows that the new approach leads to significantly better results (up to +25% MAP over BM25 baseline).
• For the best working configuration found, the overhead caused from the graph traversal part is limited (17% more than running the chosen baseline).
21
Thank you for your attention
• You can find the new relevance judgments at http://diuf.unifr.ch/xi/HybridAOR.
• More info at www.exascale.info.
• In the following days you’ll find our paper, this presentation, and the new crowdsourced relevance judgements at www.exascale.info/AOR.