sigir_faceted2006.ppt

12
Automatic Discovery of Useful Facet Terms Wisam Dakka – Columbia University Rishabh Dayal – Columbia University Panagiotis G. Ipeirotis – NYU

Transcript of sigir_faceted2006.ppt

Page 1: sigir_faceted2006.ppt

Automatic Discovery of Useful Facet Terms

Wisam Dakka – Columbia University

Rishabh Dayal – Columbia University

Panagiotis G. Ipeirotis – NYU

Page 2: sigir_faceted2006.ppt

Searching the NYT Archive for Book Research

Page 3: sigir_faceted2006.ppt

Motivation: News Archive Accessing and searching is not an easy task

Researchers and reporters spend a large amount of time going through their long query results

News archives are huge and available for tens of years Many relevant results

Results in the first page are not more relevant than the results in the 5th or the 10th page (NYT archive)

Search engines of news archive mainly follow the paradigm Search, skim through long results, modify, and search again

Goal: Multifaceted Interfaces (MI) over the news archive of Newsblaster

Newsblaster archive About 6 years of news from 24 news sources Stories are clustered daily into hierarchies of topics and events Events are threaded over time, summarized, and classified

Page 4: sigir_faceted2006.ppt

Motivation: MI for Newsblaster Archive Our multifaceted interfaces work has some

limitations [CIKM2005]: Supervised learning: facets that could be identified

by our algorithm appear in the training set WordNet hypernyms

WordNet has rather poor coverage of named entities

Free text collections The quality of the hierarchies built on top of news

stories was low.

Page 5: sigir_faceted2006.ppt

Challenge: Automatic Extraction of the Useful Facets from News Archive Automatically discover, in an unsupervised manner, a set of candidate facet terms from free text

Automatically group together facet terms that belong to the same facet

Build the appropriate browsing structure for each facet

Page 6: sigir_faceted2006.ppt

Intuition: Look for Facet Terms Elsewhere Pilot study - 100 stories from The NYTimes

Common facets: Location, Institutes, History, People, Social Phenomenon, Markets, Nature, and Event

Sub-facets: Leaders under People, Corporations under Markets Clear phenomenon: the terms for the useful facets do

not usually appear in the news stories A journalist writing a story about Jacques Chirac will not

necessarily use the terms Political Leader, Europe, or France. Such missing terms are tremendously useful for identifying the appropriate facets for the story

We will look for these terms elsewhere infrequent terms in the original collection, but are frequent in

expanded documents

Page 7: sigir_faceted2006.ppt

Context-Aware Expansion

Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year

Murkowski made the announcement

three days after BP said it would shut

down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year

Murkowski made the

announcement three days after BP

said it would shut down a Prudhoe Bay oil field after a small leak was

found. Energy officials have said

pipeline repairs are likely to take months, curtailing Alaskan production into next year

Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year

Wikipedia Wiki

Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year

Wiki TextWiki TextWiki Text

Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year

Wiki TextWiki TextWordnet Text

Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year

Wiki TextWiki TextGoogle Text

Wordnet

Google

Wordnet

Google

Name EntitiesYahoo Term Extractor

Page 8: sigir_faceted2006.ppt

0

100

200

300

400

500

600

700

800

900

1000

Term frequnecy incontext-aware documents

0

100

200

300

400

500

600

700

800

900

1000

Term frequnecy inoriginal documents

Useful Facets Terms are Elsewhere

Infrequent

Terms

Context-aware Collection

ti

Original Collection

Page 9: sigir_faceted2006.ppt

Frequency-based shifting

Due to the Zipfian nature, we favor terms that have already high frequencies (inverse problem)

Rank-shifting

Term Frequency Analysis

Page 10: sigir_faceted2006.ppt

Summary: Candidate Facet Terms For each document in the database, identify the

important terms that are useful to characterize the contents of the document

For each term in the original database, query the external resource and retrieve the terms that appear in the results. Add the retrieved terms in the original document, in order to create an expanded, “context-aware” document

Analyze the frequency of the terms, in both the original and the expanded database and identify the candidate facet terms

Page 11: sigir_faceted2006.ppt

Indicative

Page 12: sigir_faceted2006.ppt

Research in Progress

Cleaning and filtering Grouping similar facet terms under one facet Evaluation

The resulted candidate terms The resulted hierarchies