On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis...
On Burstiness-Aware Search for Document Sequences
Theodoros Lappas Benjamin Arai
Manolis Platakis Dimitrios Kotsakos
Dimitrios Gunopulos
SIGKDD 2009
SIGKDD 2009 Theodoros Lappas
Outline
The Problem: How to effectively search through large document sequences (e.g. newspapers)
Previous Work
Using Bursty Terms to identify Events
Modeling Burstiness using Discrepancy Theory
Our Search Framework
Experiments
SIGKDD 2009 Theodoros Lappas
The Problem Given a large sequence of documents (e.g. a daily newspaper)
and a query of terms, find documents that discuss major events relevant to the query.
Consider the San Francisco Call : a daily 1900s newspaper
We are given the query <theater, disaster>
Two candidate events, relevant to the query:
The disastrous fire of 1903 in the Iroquois Theater in Chicago
A disastrous performance given by an actor in a local theater
Clearly the first event is far more influential: articles on this event should be ranked higher!
SIGKDD 2009 Theodoros Lappas
Previous Work
Burstiness explored in different domains
Burst Detection - Kleinberg 2002
Stream clustering - He et al. 2007
Graph Evolution - Kumar et al. 2003
Event Detection - Fung et al. 2005
Nothing on Burstiness-aware Search:
Standard Information Retrieval techniques do not consider the underlying events discussed in the collection.
Event Detection Techniques do not consider user input.
SIGKDD 2009 Theodoros Lappas
Burstiness
Bursty periods: periods of “unusually” high frequency
Unusual? Deviating from an expected baseline.
Major Events are discussed in numerous articles for an extended timeframe.
The event’s keywords exhibit high frequency bursts during the timeframe
Frequency of the term “earthquake”, as it appeared in the SF Call , (1908 - 1909).
SIGKDD 2009 Theodoros Lappas
Modeling Burstiness using Discrepancy Theory
Discrepancy: Used to express and quantify the deviation from the norm
In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency
Maximal Interval : One that does not include and is not included in an interval of higher score.
MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.
SIGKDD 2009 Theodoros Lappas
Baseline - Discussion
Baseline can be dynamic :
– frequency sequence(s) from previous year(s)
– Time Series Decomposition to extract Seasonal, Trend and Irregular Components
SIGKDD 2009 Theodoros Lappas
Phase 1 : Preprocessing The output is the set of
terms to be monitored
The input is a raw document sequence.
Preprocessing Methods:
Stemming, Synonym matching, etc.
Stopwords Removal
Frequency Pruning for rare words
SIGKDD 2009 Theodoros Lappas
Phase 2 – Retrieval of Bursty Intervals
Input: A term
Output: Set of non-overlapping intervals + their burstiness scores
1) Create the frequency sequence for the term.
2) Extract bursty intervals using the MAX-1 algorithm
SIGKDD 2009 Theodoros Lappas
Phase 3 – Interval Indexing
Input: Set of bursty intervals for each term
Output: An Index of Intervals
Simple, easily updatable structure
Need to support multi-term queries
SIGKDD 2009 Theodoros Lappas
Phase 4 : Top- k Evaluation for Multi-Term Queries
Customized Version of the Threshold Algorithm (TA) for top-k Evaluation.
Standard Version:
– Terms-to-Documents
– Each document either appears in a term’s list or not
Our Version (TA*):
– Terms-to-Intervals
– A bursty interval of a term t1 may overlap multiple intervals of a term t2.
Up Next: Experiments
SIGKDD 2009 Theodoros Lappas
Empirical Evaluation
San Francisco Call : a daily newspaper with publication dates between 1900-1909. ~400,000 articles
List of Major Events from 1900-1909 (from Wikipedia) + query for each event.
SIGKDD 2009 Theodoros Lappas
Experiment 1 - Query Expansion
1) Submit respective query for each event in Major Events List.
2) Get top interval
3) Report the 10 terms that appear in the most document titles within the interval
SIGKDD 2009 Theodoros Lappas
Example 1
Event:King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi.
Query:“king assassination”
Umberto july state anarchist italy unit
Rome Bressi general police
SIGKDD 2009 Theodoros Lappas
Example 2
Event:Louis Bleriot is the first man to fly across the English Channel in an aircraft.
Query:“English channel”
flight july miles cross aviator attempt return Bleriot condition
machine
SIGKDD 2009 Theodoros Lappas
Experiment 2 – Burst Detection
1) Submit respective query for each event in Major Events List.
2) Get top reported interval
3) Compare with actual event date
We use MAX-1, MAX-2 to extract bursty intervals.
MAX-2 :
– Re-run MAX-1 on each interval
– Obtain nested structure
SIGKDD 2009 Theodoros Lappas
Examples
Event: A fire at the Iroquois Theater in Chicago kills 600.
Query: < theater, disaster>
ACTUAL MAX-1 MAX-2
Dec 30 1903 22 Dec - 20 Aug 31 Dec - 26 Jan
Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021.
Query: < steamboat, disaster >
ACTUAL MAX-1 MAX-2
Jun 15 1904 14 May - 4 Sep 16 Jun - 20 Jun
SIGKDD 2009 Theodoros Lappas
Conclusion
The 1st efficient end-to-end framework for burstiness-aware search in document sequences.
Future Work:
– Evaluate on even larger Corpora
– Evaluate on more types of text