ECIR – a Lightweight Approach for Entity-centric Information Retrieval

24
ECIR – a Lightweight Approach for Entity-centric Information Retrieval TREC 2010 HPI Potsdam, SAP Research Center Dresden Barczynski, Brauer, Emde, Hold, Leben, Thiele, Naumann

description

ECIR – a Lightweight Approach for Entity-centric Information Retrieval. TREC 2010 HPI Potsdam, SAP Research Center Dresden Barczynski, Brauer, Emde, Hold, Leben, Thiele, Naumann. Who we are. SAP Research. HPI ECIR seminar. - PowerPoint PPT Presentation

Transcript of ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Page 1: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

ECIR – a Lightweight Approach for Entity-centric Information Retrieval

TREC 2010

HPI Potsdam, SAP Research Center Dresden

Barczynski, Brauer, Emde, Hold, Leben, Thiele, Naumann

Page 2: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

SAP ResearchHPI ECIR seminar

Who we are

Benjamin Michael Christoph Alexander Wojciech Falk

Emde Leben Thiele Hold Barczynski Brauer

Prof. Dr. Felix

Naumann

Information Systems chair

at Hasso-Plattner-Institut

2

Page 3: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Motivation:Entity-centric Search in Enterprises

■ Companies cannot index the whole web:

□ have to leverage resources on the web (search engines)

□ extract/ rank entities at runtime (limited hardware resources)

■ But there is business value in discovering related entities, e.g., to determine competing companies:

□ Source: SAP

□ Target: organization

□ Narrative: Find competitors developing ERP software.

■ Our system evaluates the capabilities of three different search engines to answer queries for the Related Entity Finding (REF)- task.

3

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Page 4: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Processing Pipeline

4

TopicsTopics QueryRewriting

QueryRewriting

FreebasePOS Tagger

DocumentRetrieval

DocumentRetrieval

Search Engine

Rule Generator

Rule Generator

Target Entitiy Extraction

Target Entitiy Extraction

Deduplication & Ranking

Deduplication & Ranking

HomepageRetrieval

HomepageRetrievalResultsResults Clueweb

MappingCluewebMapping

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Page 5: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Processing Pipeline

5

TopicsTopics QueryRewriting

QueryRewriting

FreebasePOS Tagger

DocumentRetrieval

DocumentRetrieval

Search Engine

Rule Generator

Rule Generator

Target Entitiy Extraction

Target Entitiy Extraction

Deduplication & Ranking

Deduplication & Ranking

HomepageRetrieval

HomepageRetrievalResultsResults Clueweb

MappingCluewebMapping

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Page 6: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Query Rewriting -Key word query generation

6<entity_name>Costco</entity_name><entity_URL>cw09 − en0006 − 60 − 20817</entity_URL><target_entity>organization</target_entity><narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative>

~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … )~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … )

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Page 7: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Query Rewriting-Synonym Retrieval

7

~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … )~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … )

get alternative names from Freebase and take the most popular (rank using search engines hit count)

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

<entity_name>Costco</entity_name><entity_URL>cw09 − en0006 − 60 − 20817</entity_URL><target_entity>organization</target_entity><narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative>

Page 8: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Query Rewriting -Filtering using POS-tags

8

~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … )~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … )

Find/VB homepages/NNS of/IN manufacturers/NNS of/IN LCD/NNP televisions/NNS sold/VBN by/IN Costco/NNP

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

<entity_name>Costco</entity_name><entity_URL>cw09 − en0006 − 60 − 20817</entity_URL><target_entity>organization</target_entity><narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative>

Page 9: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Processing Pipeline

9

TopicsTopics QueryRewriting

QueryRewriting

FreebasePOS Tagger

DocumentRetrieval

DocumentRetrieval

Search Engine

Rule Generator

Rule Generator

Target Entitiy Extraction

Target Entitiy Extraction

Deduplication & Ranking

Deduplication & Ranking

HomepageRetrieval

HomepageRetrievalResultsResults Clueweb

MappingCluewebMapping

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Page 10: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

■ Source entitiy and its alternative names:

■ Mapping a topic’s target type to predefined ThingFinder-types

■ Stemmed noun and verb tokens from a topic’s narrative

■ Combined rule to extract candidates

Extraction Rule Constructionfor SAP Business Objects ThingFinder

10

#group Candidate (scope="Paragraph"): [UL]%(TargetType),%(SourceName),(%(Context))[/UL]

#group Candidate (scope="Paragraph"): [UL]%(TargetType),%(SourceName),(%(Context))[/UL]

#group SourceName (scope="Sentence"):(<Costco>|<Costco><travel>|<Costco><Wholesale><Corp.>|..)#group SourceName (scope="Sentence"):(<Costco>|<Costco><travel>|<Costco><Wholesale><Corp.>|..)

#group TargetType (scope="Sentence"): [TE ORGANIZATION]<>+[/TE]

#group TargetType (scope="Sentence"): [TE ORGANIZATION]<>+[/TE]

#group Context (scope="Sentence"): (<STEM:sold>|<STEM:LCD>|<STEM:manufacturer>|..)

#group Context (scope="Sentence"): (<STEM:sold>|<STEM:LCD>|<STEM:manufacturer>|..)

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Mischka
Vielleicht ein kurzer Überblick zu BOBJ-TA.Custom setup vs. spezialisierte Regeln- alternative names: ist klar- Thingfinder type: was heißt "TE" - der thingfindertype ist "Organization"- N/V tokens: wer macht das stemming ? POS-Tagger output?? TF macht das.Kann man die Regeln mit RegEx vergleichen?
Mischka
Prof Naumann wünscht sich die Bulletpoints hervorgehobener und die Originalsyntax eher im Hintergrund
Page 11: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

EXTRACTED PARAGRAPH:EXTRACTED PARAGRAPH:

Example Candidate Rule Firing

SENTENCE#1: Update, October 2006: Prices have dropped everywhere in the flat panel TV market, even among top-tier manufacturers like Panasonic, Sharp and Sony.

SENTENCE#2: While Costco offers a 50" Vizio HDTV LCD for around $2,000, top-notch sets from Panasonic are selling for as low as $2,400 at authorized internet dealers—that factors out to a difference of only $40 a year if you consider the 10-year lifespan of the Panasonic Plasma TV.

SENTENCE#3: From what we've seen of the build quality in the "bargain-basement" models by Vizio, Maxent, and Envision, you'll be lucky if your discount 50" televisions lasts half that long.

11

Source Entity Context Token Potential Target Entitiy

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

michael.leben
Mischka
Naumann: kürzer.War sehr ausführlich erklärt, obwohl auf die Einzelnen Probleme (duplicates, entity ranking) später noch ausführlich eingegangen wird
Page 12: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Processing Pipeline

12

TopicsTopics QueryRewriting

QueryRewriting

FreebasePOS Tagger

DocumentRetrieval

DocumentRetrieval

Search Engine

Rule Generator

Rule Generator

Target Entitiy Extraction

Target Entitiy Extraction

Deduplication & Ranking

Deduplication & Ranking

HomepageRetrieval

HomepageRetrievalResultsResults Clueweb

MappingCluewebMapping

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Page 13: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Entity Deduplication and Ranking per document

■ Deduplication combines Jaro-Winkler and Jaccard Similarity.

■ local ranking (duplicate group per document):

□ distance in sentences between Source and Target Entity

□ normalized such that score for distance = 0 is score = 1

□ per duplicate group consider only maximum score per document

■ rank 1: Vizio, Panasonic (sentence distance 0)

■ rank 2: Sharp, Sony, Maxent, Envision (sentence dinstance 1)

13

…even among top-tier manufacturers like Panasonic, Sharp and Sony.

While Costco offers a 50" Vizio HDTV LCD … from Panasonic are selling … consider the 10-year

lifespan of the Panasonic Plasma TV.

From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky …

…even among top-tier manufacturers like Panasonic, Sharp and Sony.

While Costco offers a 50" Vizio HDTV LCD … from Panasonic are selling … consider the 10-year

lifespan of the Panasonic Plasma TV.

From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky …

Page 14: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Entity Deduplication and Ranking across documents

■ global ranking (for duplicate group in document corpus)

□ aggregated local scores for duplicate groups

□ considers rank position of documents

□ Target Entities extracted from higher ranked documents are preferred

14

… stores Costco Wholesale and Sam's Clubs.

In addition, the quality of Vizio LCDs looks very similar to the Sonys and Samsungs on …

… stores Costco Wholesale and Sam's Clubs.

In addition, the quality of Vizio LCDs looks very similar to the Sonys and Samsungs on …

document b; search engine rank 1

document a; search engine rank 5

…even among top-tier manufacturers like Panasonic, Sharp and Sony.

While Costco offers a 50" Vizio HDTV LCD … from Panasonic are selling … consider the 10-year

lifespan of the Panasonic Plasma TV.

From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky …

…even among top-tier manufacturers like Panasonic, Sharp and Sony.

While Costco offers a 50" Vizio HDTV LCD … from Panasonic are selling … consider the 10-year

lifespan of the Panasonic Plasma TV.

From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky …

Page 15: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Processing Pipeline

15

TopicsTopics QueryRewriting

QueryRewriting

FreebasePOS Tagger

DocumentRetrieval

DocumentRetrieval

Search Engine

Rule Generator

Rule Generator

Target Entitiy Extraction

Target Entitiy Extraction

Deduplication & Ranking

Deduplication & Ranking

HomepageRetrieval

HomepageRetrievalResultsResults Clueweb

MappingCluewebMapping

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Page 16: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Homepage Candidate Retrieval

16

expanded query:~sold ~homepages ~manufactures ~LCD ~television (“Costco” OR “Costco Wholesale” OR … )

expanded query:~sold ~homepages ~manufactures ~LCD ~television (“Costco” OR “Costco Wholesale” OR … )

Topic:<entity_name>Costco</entity_name><entity_URL>cw09 − en0006 − 60 − 20817</entity_URL><target_entity>organization</target_entity><narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative>

Target Entities:Panasonic, Sharp, Sony, Vizio, Maxent, EnvisionTarget Entities:Panasonic, Sharp, Sony, Vizio, Maxent, Envision

queries to the selected search enginequeries to the selected search engine

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Page 17: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Homepage Candidate Retrieval

17

homepage candidates set:search engine resultsWikipedia outgoing linksshortened URLs

homepage candidates set:search engine resultsWikipedia outgoing linksshortened URLs

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

queries to the selected search engine:http://www.google.com/search?q=viziohttp://www.google.com/search?q=feature:homepage+viziohttp://www.google.com/search?q=allintitle:viziohttp://www.google.com/search?q=allinanchor:Find+homepages+of+manufacturers...

queries to the selected search engine:http://www.google.com/search?q=viziohttp://www.google.com/search?q=feature:homepage+viziohttp://www.google.com/search?q=allintitle:viziohttp://www.google.com/search?q=allinanchor:Find+homepages+of+manufacturers...

Page 18: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Homepage Ranking | source-specific

aggregate scores of candidates from multiple sources

extract vector of 17 features for each candidate:

5 source-specific flags (f1 to f5)

12 text-based page features (f6 to f17)

18homepage candidates sethomepage candidates set

search engine resultshttp://www.vizio.com/warranties-installation/http://www.vizio.com/discover/via/

search engine resultshttp://www.vizio.com/warranties-installation/http://www.vizio.com/discover/via/

Wikipedia outgoing Linkshttp://www.youreviewelectronics.com/vizio-reviews/http://www.vizio.com/

Wikipedia outgoing Linkshttp://www.youreviewelectronics.com/vizio-reviews/http://www.vizio.com/

shortened URLshttp://www.vizio.com/http://www.vizio.com/discover/http://wiki.answers.com/Qhttp://www.youreviewelectronics.com/vizio-reviews

shortened URLshttp://www.vizio.com/http://www.vizio.com/discover/http://wiki.answers.com/Qhttp://www.youreviewelectronics.com/vizio-reviews

allintitleoperatorallintitleoperator

related to Wikipedia pagehttp://www.vizio.com/productshttp://wiki.answers.com/Q/Who_manufactures_Vizio_TV

related to Wikipedia pagehttp://www.vizio.com/productshttp://wiki.answers.com/Q/Who_manufactures_Vizio_TV

allinanchoroperatorallinanchoroperator

f1 f2

f3feature:homepage operator

http://www.vizio.com/products

feature:homepage operator

http://www.vizio.com/products

f4

f5

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

Mischka
feature:homepage nur für Yahoo: sites mit "~"
Page 19: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Homepage Ranking | configurable

features:19

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

feature values are multiplied with configurable weights

finding best weight configuration:

depends on search engine

genetic algorithm for training

Page 20: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Evaluation | TREC2010 nDCG values

B = Bing, G = Google, Y = Yahoo; 64/16: number of documents retrieved for entity extraction

TREC2010-ECIR HPI/SAP | HPI information systems | SAP research | Nov 09, 2010

20

Page 21: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Evaluation | TREC2010 nDCG values

B = Bing, G = Google, Y = Yahoo; 64/16: number of documents retrieved for entity extraction

nDCG aggregated over all topics

TREC2010-ECIR HPI/SAP | HPI information systems | SAP research | Nov 09, 2010

21

Page 22: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Evaluation 2009 | Target Entity Types

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

22 all runs averaged score for TREC2009 topics

performance for Organization and Persons better than for Products

Page 23: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Results from TREC 2010

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

23 avg nDCG around 0.8

Page 24: ECIR – a Lightweight Approach for Entity-centric Information Retrieval

Results from TREC 2010 | bugfixed

TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

24 avg nDCG around 1.15