Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich

USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS

WITH REVISION HISTORY ANALYSISAblimit Aji, Yu Wang

Eugene Agichtein, Evgeniy Gabrilovich

Oct. 28, 2010

Revisions of “Topology” on Wikipedia

1st revision:

250th revision:

Current revision:

Observable Document Generation Process

In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Roughly speaking, topology is the study of geometric objects without considering their dimensions.

In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Topology is also concerned with the study of the so called topological properties of figures, that is to say properties that does not change under a bicontinuous one-to-one transformation (call homeomorphisms

95th revision 96th revision

#i#i-1

How Revision History Analysis Could Help Retrieval

Revision History Analysis

Selected Prior Work

• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM,2010.

• M. Efron. Linear time series models for term weighting in information retrieval. JASIST, 2010.

• J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, New York, NY, USA, 2009.

Revision History Analysis (RHA)

RHA redefines term frequency (TF):- TF is a key indicator of document relevance- TF can be naturally integrated into ranking models

𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄

𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )

𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )

𝑆 (𝑄 ,𝐷 )=𝐷 ¿

Language Model

Model 1: Steady growth

Topology (from the Greek τόπος, “place”, and λόγος, “study”) is a major area of mathematics concerned with spatial properties that are preserved under continuous deformations of objects, for example…..basic examples include compactness and connectedness

Topology, in mathematics, is both a structure used to capture the notions of continuity, connectedness and convergence, and the name of the branch of mathematics which studies these.

First revision

Current version

Model 1 (continued)

RHA Global Model: definition

Define the term frequency over the whole document generation process– a document grows steadily over time– a term is relatively important if it appears in the early

revisions.

𝑇𝐹 𝑔𝑙𝑜𝑏𝑎𝑙 (𝑡 ,𝑑)=∑𝑗=1

𝑛 𝑐 (𝑡 ,𝑣 𝑗)

𝑗𝛼

Frequency of term in revision

Decay factor

But… Some pages are different: “Avatar(2009 film)”

1st revision:

500th revision:

Current revision:

Model 2: Bursty Growth

TimeTerm Frequency

Document Length“Pandora” “James Cameron”

Nov. 2009 9 23 2576Dec. 2009 25 50 6306

Month (2009) Jul. Aug. Sep. Oct Nov. Dec.Edit Activity 89 224 67 154 232 1892

First photo & trailer released Movie released

Burst of Document (Length) & Change of Term Frequency

Burst of Edit Activity & Associated Events

Global Model might be insufficient

RHA Burst Model: Definition

• A burst resets the decay clock for a term.• The weight will decrease after a burst.

𝑇𝐹 𝑏𝑢𝑟𝑠𝑡 (𝑡 ,𝑑 )=∑𝑗=1

∑𝑘=𝑏 𝑗

𝑛 𝑐 (𝑡 ,𝑣𝑘)

(𝑘−𝑏 𝑗+1)𝛽

Frequency of term in revision

Decay factor for jth Burst

Burst Detection (1): Content-based

Relative content change potential burst

Content-based Burst for “Avatar”

Burst Detection (2): Activity Based

Intensive edit activity potential bursts

Activity-based Burst for “Avatar”

Average revision counts

Deviation

Burst Detection (3): Combined Model

Putting it All Together: RHA Term Frequency--Combining global model and burst model

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑇𝐹𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑇𝐹 𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑇𝐹 (𝑡 ,𝐷 )RHA Term Frequency:

ndicate the weights of RHA global model, burst model and original term frequency (probability).

𝜆1+𝜆2+𝜆3=1

Integrating RHA into Retrieval Models

𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄

𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )

𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )

𝑆 (𝑄 ,𝐷 )=𝐷 ¿Statistical Language Models

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )

𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )

RHA Term Probability:

𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑃𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑃𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑃 (𝑡 ,𝐷 )

Experimental Setup

Datasets

INEX: well established forum for structured retrieval tasks (based on Wikipedia collection)TREC: performance comparison on different set of queries and general applicability

INEX 65 topic

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich

Documents

Transcript of Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

1 Scalable Information Extraction Eugene Agichtein.

1 Natural Language Processing @ Emory Eugene Agichtein Math & Computer Science and CCI Andrew Post CCI and Biomedical Engineering (?)

Eugene Agichtein Microsoft Research - Emory University

Modeling Information Seeking Behavior in Social Media Eugene Agichtein Intelligent Information Access Lab (IRLab)

1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.

Text Mining Tutorial – Part IIcci.drexel.edu/bigdata/bigdata2016/files/Tutorial1-2.pdf · 2016-12-12 · Text Mining Tutorial ... Gabrilovich & Markovitch IJCAI-2007. How to categorize

Machine Learning Applications to Modeling Web …eugene/talks/agichtein-mining-behavior.pdf · Machine Learning Applications to Modeling Web Searcher Behavior Eugene Agichtein Intelligent

Learning to Find Answers to Questions Eugene Agichtein Steve Lawrence Columbia University NEC Research Luis Gravano Columbia University.

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users Panos Ipeirotis, Stern School of Business, New York University Evgeniy Gabrilovich, Google.

1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.

Modeling Information-Seeker Satisfaction in Community ...yandongl/papers/tkdd2009.pdf · Modeling Information-Seeker Satisfaction in Community Question Answering EUGENE AGICHTEIN

Date: 2013/9/25 Author: Mikhail Ageev , Dmitry Lagun , Eugene Agichtein Source : SIGIR’13

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

gabrilovich.comgabrilovich.com/publications/papers/Bilenko2009IRA.pdf · i Organizing Committee Mikhail Bilenko Microsoft Research Evgeniy Gabrilovich Yahoo! Research Matthew Richardson

Design Patterns I - mathcs.emory.edu · Design Patterns I Encapsulation,Observer,andDecorator CS370SEPracticum,CengizGünay (Some slides courtesy of Eugene Agichtein and the Internets)

Question Answering over Implicitly Structured Web Content Eugene Agichtein*Emory University Chris BurgesMicrosoft Research Eric BrillMicrosoft Research.

Eugene Agichtein and Silviu Cucerzan Microsoft Research