Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

Combining Inverted Indices and Structured Search for

Ad-hoc Object RetrievalAlberto Tonon, Gianluca Demartini, Philippe Cudré-

Mauroux

eXascale Infolab - University of Fribourg - Switzerland{firstname.lastname}@unifr.ch

SIGIR2012 - Monday, August 13th 2012

mailto:[email protected]

2

Motivation• Lot of search engines queries are

about entities.

• Increasingly large amount of entity data online.

• Often represented as huge graphs

• e.g. the LOD cloud, Google Knowledge Graph, Facebook social graph.

• Globally unique Entity identifiers (e.g., URIs) .

• Hard to discover and/or memorize.

3

Ad-hoc Object Retrieval(informal definition)

• “Given the description of an entity, give me back its identifier”

• Description can be keywords (e.g., “Harry Potter”).

• More than one identifier per entity (e.g., dbpedia + freebase).

• How to evaluate returned results?

Ad-hoc Object Retrieval(formal definition by Pound et al.)

• Input: unstructured query q and data graph G.

• Output: ranked list of resource identifiers (URIs) from G.

• Evaluation: results (URIs) scored by a judge with access to all the information contained in or linked to the resource.

• Standard collections exist.

+

1. http://ex.plode.us/tag/harry+potter

1. http://www.vox.com/explore/interests/harry%20potter

1. http://www.flickr.com/groups/harrypotterandthedeathlyhallows/

1. http://harrypotter.wizards.pro/

1. http://ex.plode.us/tag/harry+potter

1. http://www.vox.com/explore/interests/harry%20potter

1. http://www.flickr.com/groups/harrypotterandthedeathlyhallows/

1. http://harrypotter.wizards.pro/

http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows

http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.uk

http://harrypotter.wizards.pro/

http://ebiquity.umbc.edu/person/html/Harry/Chen/

http://dbpedia.org/resource/Ceramist

http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows

http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.uk

http://harrypotter.wizards.pro/

http://ebiquity.umbc.edu/person/html/Harry/Chen/

http://dbpedia.org/resource/Ceramist

http://ex.plode.us/tag/harry+potter

http://ex.plode.us/tag/harry+potter

5

Overview of Our SolutionInverted indices

on the LOD Cloud...

...and RDF store containing the data.

Simple NLP techniques,Autocompletion,Pseudo-relevance feedback BM25,

BM25F

6

Pseudo-Relevance FeedbackNLP techniques

Query auto-completion

A Simple Example

SIGIRSIGIR

Graph traversals

Final ranking function

2. http://freebase.com/…/sigir3. http://dbpedia.org/…/IRAQ…

1. http://dbpedia.org/…/SIGIR

Which properties should we follow?

How to rank new results?

II + ranking function(s)

2. http://dbpedia.org/…/IRAQ3. …

…

1. http://dbpedia.org/…/SIGIR

How to build the

II?

7

Outline1. Inverted Indices

2. Graph Based Entity Search

1. Object Properties vs Datatype Properties

2. Properties to Follow

3. Experimental Results

1. Experimental Setting

2. IR Techniques: Experimental Results

3. Evaluation of the Hybrid Approaches

4. Overhead of the Graph Traversal

8

1. Inverted Indices (IIs)• Simple inverted index:

• index all literals attached to each node in the input graph.

• “movie” → http://…types/film

• Structured inverted index with three fields:

• URI - tokenized URIs identifying entities.

• Label - manually selected datatype properties to textual descriptions of the entity (e.g., label, title, name, full-name, …).

• Attributes - all other literals. BM25(F), query auto-completion, query extension, relevance feedback8

9

New URIs

...

2. Graph-Based Entity Search

IR results

...

...N

p1

p2

p_m

p1

p2

p_m

sim(e, q) > τ?

...

Assign Scores

0.284

1.428

0.556

Merged Re-Ranked Results

...

Take top-N docs.

Follow links/properties and get new URIs.

Filter new results by text similarity wrt the user query.

Scoring functions:count sim > τ,avg sim > τ,Sum sim,Avg sim,Sum BM25 - ε

10

2. 1. Object Properties vsDatatype Properties

• Object Properties:• connect different entities

• explore all the graph

• Datatype properties:• give additional info about entities

• explore just the neighborhood of a node

11

2.2. properties to follow

• RDF graph queried with SPARQL queries.

• Scope 1 queries vs Scope 2 queries.

• Set of predicates to follow selected using:

• Common sense (e.g., sameAs)

• Statistics from the data

12

properties to follow: Two Examples

Entry point given by

the II

13

3. Experimental results

14

3.1 Experimental Setting• SemSearch 2010 and 2011 testsets:

• Billion Triple Challenge 2009 (BTC2009)

• 1.3 billions RDF triples crawled from the LOD cloud.

• 92 and 50 queries, respectively.

• Evaluation of systems with depth-10 pooling by means of crowdsourcing.

• Measures taken into consideration: Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), early Precision (P10)

15

Completing Relevance by Crowdsourcing Judgements

• We obtained relevance judgments for unjudged entities in the top-10 results of our runs by using Amazon MTurk.

• To be fair we used the same design and settings that were used for the AOR task of SemSearch.

16

3.2. IR Techniques: Experimental ResultsOur

Baseline.

18

3.3. Evaluation of Hybrid Approaches

N = 3, τ = 0, score = sumBM25 - ε

19

3.4. Overhead of the Graph traversal

• Time in milliseconds needed for each part of the hybrid approaches.

• Measures taken on a single machine with cold cache.

Surprisingly small overhead (17% for best results).

20

Conclusions

• AOR = “Given the description of an entity, give me back its identifier”

• Disappointing results using simple IR techniques for AOR task.

• Hybrid system for AOR:

• combining classic IR techniques + structured database storing graph data.

• Our evaluation shows that the new approach leads to significantly better results (up to +25% MAP over BM25 baseline).

• For the best working configuration found, the overhead caused from the graph traversal part is limited (17% more than running the chosen baseline).

21

Thank you for your attention

• You can find the new relevance judgments at http://diuf.unifr.ch/xi/HybridAOR.

• More info at www.exascale.info.

• In the following days you’ll find our paper, this presentation, and the new crowdsourced relevance judgements at www.exascale.info/AOR.

Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

Technology

Transcript of Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval