Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich
description
Transcript of Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich
![Page 1: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/1.jpg)
1
USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS
WITH REVISION HISTORY ANALYSISAblimit Aji, Yu Wang
Eugene Agichtein, Evgeniy Gabrilovich
Oct. 28, 2010
![Page 2: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/2.jpg)
2
Revisions of “Topology” on Wikipedia
1st revision:
250th revision:
Current revision:
![Page 3: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/3.jpg)
3
Observable Document Generation Process
In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Roughly speaking, topology is the study of geometric objects without considering their dimensions.
In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Topology is also concerned with the study of the so called topological properties of figures, that is to say properties that does not change under a bicontinuous one-to-one transformation (call homeomorphisms
95th revision 96th revision
#i#i-1
![Page 4: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/4.jpg)
4
How Revision History Analysis Could Help Retrieval
Revision History Analysis
![Page 5: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/5.jpg)
5
Selected Prior Work
• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM,2010.
• M. Efron. Linear time series models for term weighting in information retrieval. JASIST, 2010.
• J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, New York, NY, USA, 2009.
![Page 6: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/6.jpg)
6
Revision History Analysis (RHA)
RHA redefines term frequency (TF):- TF is a key indicator of document relevance- TF can be naturally integrated into ranking models
𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄
𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )
𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )
𝑆 (𝑄 ,𝐷 )=𝐷 ¿
BM25
Language Model
![Page 7: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/7.jpg)
7
Model 1: Steady growth
Topology (from the Greek τόπος, “place”, and λόγος, “study”) is a major area of mathematics concerned with spatial properties that are preserved under continuous deformations of objects, for example…..basic examples include compactness and connectedness
Topology, in mathematics, is both a structure used to capture the notions of continuity, connectedness and convergence, and the name of the branch of mathematics which studies these.
First revision
Current version
![Page 8: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/8.jpg)
8
Model 1 (continued)
![Page 9: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/9.jpg)
9
RHA Global Model: definition
Define the term frequency over the whole document generation process– a document grows steadily over time– a term is relatively important if it appears in the early
revisions.
𝑇𝐹 𝑔𝑙𝑜𝑏𝑎𝑙 (𝑡 ,𝑑)=∑𝑗=1
𝑛 𝑐 (𝑡 ,𝑣 𝑗)
𝑗𝛼
Frequency of term in revision
Decay factor
![Page 10: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/10.jpg)
10
But… Some pages are different: “Avatar(2009 film)”
1st revision:
500th revision:
Current revision:
![Page 11: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/11.jpg)
11
Model 2: Bursty Growth
TimeTerm Frequency
Document Length“Pandora” “James Cameron”
Nov. 2009 9 23 2576Dec. 2009 25 50 6306
Month (2009) Jul. Aug. Sep. Oct Nov. Dec.Edit Activity 89 224 67 154 232 1892
First photo & trailer released Movie released
Burst of Document (Length) & Change of Term Frequency
Burst of Edit Activity & Associated Events
Global Model might be insufficient
![Page 12: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/12.jpg)
12
RHA Burst Model: Definition
• A burst resets the decay clock for a term.• The weight will decrease after a burst.
𝑇𝐹 𝑏𝑢𝑟𝑠𝑡 (𝑡 ,𝑑 )=∑𝑗=1
𝑚
∑𝑘=𝑏 𝑗
𝑛 𝑐 (𝑡 ,𝑣𝑘)
(𝑘−𝑏 𝑗+1)𝛽
Frequency of term in revision
Decay factor for jth Burst
![Page 13: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/13.jpg)
13
Burst Detection (1): Content-based
Relative content change potential burst
Content-based Burst for “Avatar”
![Page 14: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/14.jpg)
14
Burst Detection (2): Activity Based
Intensive edit activity potential bursts
Activity-based Burst for “Avatar”
Average revision counts
Deviation
![Page 15: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/15.jpg)
15
Burst Detection (3): Combined Model
![Page 16: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/16.jpg)
16
Putting it All Together: RHA Term Frequency--Combining global model and burst model
𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑇𝐹𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑇𝐹 𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑇𝐹 (𝑡 ,𝐷 )RHA Term Frequency:
ndicate the weights of RHA global model, burst model and original term frequency (probability).
𝜆1+𝜆2+𝜆3=1
![Page 17: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/17.jpg)
17
Integrating RHA into Retrieval Models
𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄
𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )
𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )
BM25
𝑆 (𝑄 ,𝐷 )=𝐷 ¿Statistical Language Models
𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )
𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )
𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )
+ RHA
+ RHA
RHA Term Probability:
𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑃𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑃𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑃 (𝑡 ,𝐷 )
![Page 18: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/18.jpg)
18
Experimental Setup
![Page 19: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/19.jpg)
19
Datasets
INEX: well established forum for structured retrieval tasks (based on Wikipedia collection)TREC: performance comparison on different set of queries and general applicability
INEX 65 topic
Top 1000 retrieved articles
1000 revisions for each article Corpus for INEX
TREC 68 topic
Top 1000 retrieved articles
1000 revisions for each article Corpus for TREC
WikiDump
![Page 20: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/20.jpg)
20
Results
![Page 21: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/21.jpg)
21
INEX Results
Model bpref MAP R-precision
BM25 0.354 0.354 0.314
BM25+RHA 0.375 (+5.93%) 0.360 (+1.69%) 0.337 (+7.32%)
LM 0.357 0.370 0.348
LM+RHA 0.372 (+4.20%) 0.378 (+2.16%) 0.359 (+3.16%)
Parameters tuned on INEX query Set
BM25: , LM: ,
![Page 22: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/22.jpg)
22
TREC Results
Model bpref MAP NDCGBM25 0.524 0.548 0.634BM25+RHA 0.547** (+4.39%) 0.568 ** (+3.65%) 0.656** (+3.47%)LM 0.527 0.556 0.645LM+RHA 0.532 (+0.95%) 0.567 (+1.98%) 0.653 (+1.24%)
parameters tuned on INEX query Set, ** indicates statistically significant differences @ the 0.01 significance level with two tailed paired t-test
Lab members manually labeled top 20 results for each topic
BM25: , LM: ,
![Page 23: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/23.jpg)
23
Performance AnalysisPerformance Improvements on bpref for BM25+RHA over baseline (BM25)
INEX: significant improvement on 40% queriesTREC: significant improvement on 37% queriesEx: “circus acts skills” , “olive oil health benefit” (+20% BM25 ,+11% LM improvement)
INEX TREC
![Page 24: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/24.jpg)
24
Summary
o RHA captures importance signal from document authoring process.
o Introduced RHA term weighting approacho Natural integration with state of the art
retrieval models.o Consistent improvement over baseline
retrieval models
![Page 25: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/25.jpg)
25
Thank you!
Using the Past to Score the Present: Extending Term Weighting Models with Revision History Analysis
Ablimit Aji, Yu Wang, Eugene Agichtein, Evgeniy Gabrilovich
Research partially supported by:
![Page 26: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/26.jpg)
26
Query Sets and Evaluation Metrics
• Queries and Labels:– INEX: provided– TREC: subset of ad-hoc track
• Metrics: – Bpref (robust to missing judgments)– MAP: mean average precision– R-prec: precision at position R
![Page 27: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/27.jpg)
27
RHA in Statistical Language Models
o (Global Model)
o (Burst Model)
![Page 28: Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich](https://reader036.fdocuments.net/reader036/viewer/2022062410/5681588c550346895dc5ea87/html5/thumbnails/28.jpg)
28
Cross validation on INEX
Model bpref MAP R-precisionBM25 0.307 0.281 0.324BM25+RHA 0.312 (+1.63%) 0.291 (+3.56%) 0.320 (-1.23%)LM 0.311 0.284 0.348LM+RHA 0.338 (+8.68%) 0.298 (+4.93%) 0.359 (+0.61%)
5-fold cross validation on INEX 2008 query Set
Model bpref MAP R-precision
BM25 0.354 0.354 0.314
BM25+RHA 0.363 (+2.54%) 0.348 (-1.70%) 0.333 (+6.05%)
LM 0.357 0.370 0.348
LM+RHA 0.366 (+2.52%) 0.375 (+1.35%) 0.352 (+1.15%)
5-fold cross validation on INEX 2009 query Set