Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at...
-
Upload
kerrie-johnson -
Category
Documents
-
view
217 -
download
0
Transcript of Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at...
Information Retrieval with Time Series Query
Hyun Duk Kim (now at Twitter) , Danila Nikitin (now at Google), ChengXiang Zhai
University of Illinois at Urbana-Champaign
Malu Castellanos, Meichun HsuHP Laboratories
… Time
Any clues in the companion news stream?Dow Jones Industrial Average [Source: Yahoo Finance]
IR for stock market analysis?
What might have caused the stock market crash?
Sept 11 Attack!
What documents to read to analyze such a “causal” topic?
Analysis of Presidential Prediction Markets
What might have caused the sudden drop of price for this candidate?
What “mattered” in this election?
… Time
Any clues in the companion news stream?
Tax cut?
What documents to read to analyze such a “causal” topic?
… Time
Any clues in the companion product reviews?
Analysis of Product Sales
What might have caused the decrease of sales?
safety concerns
What reviews to read to analyze such a “causal” topic?
… Time
Which documents cover such a “trendy” topic?
Finding documents about “trendy” topics
Draw a “time series query”: Find documents about a topic emerging this summer, which has attracted much attention this Oct
Information Retrieval with Time Series Query
• Instead of keyword query, use time series as a query Retrieve documents that contain topics that are correlated with the query time series
• Input: – Time series data with time stamp
– Text stream which is a collection of documents with time stamp within the same time period
• Output– Ranked list of documents
Ideal Results of Information Retrieval with Time Series Query
2000 2001 …
News
7/3/2000
7/29/2000
8/24/2000
9/19/2000
10/15/2000
11/10/2000
12/6/2000
1/1/2001
1/27/2001
2/22/2001
3/20/2001
4/15/2001
5/11/2001
6/6/2001
7/2/2001
7/28/2001
8/23/2001
9/18/2001
10/14/2001
11/9/2001
12/5/2001
12/31/2001
010203040506070
Apple Stock Price
Date
Price
($)
RANK DATE EXCERPT
1 9/29/2000 Expect earning will be far below
2 12/8/2000 $4 billion cash in company
3 10/19/2000 Disappointing earning report
4 4/19/2001 Dow and Nasdaq soar after rate cut by Federal Reserve
5 7/20/2001 Apple's new retail store
… … …
IR w/ TS - Method Overview
Sep , 2001 Oct , 2001 …
Text Stream
Non-textTime Series
Vocabulary, Word Frequency
Curves
W1
W2
W3
W4
…
Input 1
Input 2
Rank by Correlation
……………
Ranked Docu-ments
Output
… ……
… …
Input Documents
IR w/ TS - Method Overview
…
Sep , 2001 Oct , 2001 …
Text Stream
Non-textTime Series
Vocabulary, Word Frequency
Curves
W1
W2
W3
W4
…
Rank by Correlation
Input 1
Input 2
……
… ………………
Ranked Docu-ments
OutputInput Documents
1. How to measure correlation between word and time series
2. How to aggregate word correlations to
rank documents
Correlation Function
• Measure correlation between word frequency curve vs. input time series
1. Pearson Correlation– Basic correlation
2. Dynamic Time Warping [Senin`08]
– Capture alignment of shifted or stretched time series
Series before alignment Time series Alignment
Val
ues
Time
Aggregation Function
• Score document correlation by aggregating word correlations
1. Weighted TF-IDF (BM25)– Use top K correlated words as a text query
Use IR formula such as BM25
– Use correlation coefficient as a weight
Aggregation Function
2. Average Correlation
a) Average over all terms:
Not all the words are correlated?
b) Average over top-k terms:
May be dominated by multiple occurrences of the same term
c) Average over top-k unique terms:
Evaluation
• Data Set– New York Times corpus (Jul 2000~Dec 2001)
• Entity annotated
– Daily Stock prices of 24 companies
• Measure– Mean average precision (MAP)
– Normalized discounted cumulative gain (NDCG)
• Research questions
1. Can our method retrieve meaningful documents?
2. Does DTW outperform Pearson Correlation?
3. Which aggregation function works the best?
Top ranked documents by American Airlines stock price
Rank Date Excerpt
1 10/22/2001 Fleeing the war
2 12/11/2001 Us and anti-Taliban forces in Afghanistan
3 11/18/2001 Fate of Taliban Soldiers Under Discussion
4 11/12/2001 Tally and dead and missing in Sep 11 terrorist attacks
5 9/25/2001 Soldiers in Afghanistan …
6 11/19/2001 Recover operation at World Trade Center
7 11/3/2001 4343 died or missing as a result of the attacks on Sep 11
8 11/17/2001 Dead and missing report of Sep 11 attack
… … …
All top ranked documents are related to September 11, terrorist attack
Top Correlated Words to American Airlines stock price
• All top correlated terms to input time series are related to terrorist attack
Highly correlated terms contributed to retrieval of documents about this topic
Word |ρ|
challenged 0.887031
afghanistan 0.861351
security 0.858745
sept 0.858309
terrorism 0.854865
pakistan 0.848829
aghans 0.844596
afghan 0.843481
islamic 0.842499
taliban 0.841455
Top ranked ‘relevant’ documents for Apple stock price
Rank Date Excerpt
1 9/29/2000 Fourth-quarter earning far below estimates
2 12/8/2000 $4 billion reserve, not $11 billion
3 10/19/2000 Announced earnings report
4 4/29/2001 Dow and Nasdaq soar after rate cur by Federal Reserve
5 7/20/2001 Apple’s new retail stores
6 12/6/2000 Apple warns it will record quarterly loss
7 3/24/2001 Stocks perk up, with Nasdaq posing gain
8 8/10/2000 Mixing Mac and Windows
… … …• Retrieved relevant event: Disappointing earning report, store open, etc.
• Useful as a new feature for re-ranking search results?
Quantitative Evaluation
• All our methods > Random precision (0.0013)
• Dynamic time warping >> Pearson correlation
Pearson DTW
MAP NDCG MAP NDCG
0.0019 0.3515 0.0022 0.3609
- Average performance (Average correlation as aggregation method)
Comparison of Aggregation Methods
• AC << TopK, BM25
• Top5-AC << Top20-AC,but not more than K=20
• BM25 is sensitive to parameter setting– Scores of AC methods are
more meaningful
• Incomplete judgments Possibly much better performance in reality
MAP NDCG
AC 0.0019 0.3515
Top5-AC 0.0021 0.361
Top10-AC 0.0023 0.3618
Top20-AC 0.0024 0.3629
Top5-AC-Uniq 0.0022 0.3613
Top10-AC-Uniq 0.0022 0.3616
Top20-AC-Uniq 0.0022 0.3619
Top5-BM25 0.0019 0.3584
Top10-BM25 0.0023 0.361
Top20-BM25 0.0019 0.3582
- Average performance (w/ Pearson correlation)
“Higher” NDCG vs. Low MAP
Summary
• Introduced a novel retrieval problem– time series as query
• Studied basic solutions: Time series representation of terms– Term retrieval: correlation(query, term)
– Document retrieval: aggregation of term retrieval results
• Dynamic time warping + top-K average correlation seems working well
Limitations & Future Work
• Evaluation is based on simulation– Highly incomplete judgments!
– What’s a good way to evaluate such a new retrieval task?
• Current solutions are heuristic– How can we develop a more principled model?
• Different notions of relevance– “Local” relevance vs. global relevance?
• All other issues relevant to a standard retrieval problem are worth exploring (e.g., feedback?)
Thank You! Comments/Questions?