Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better...

Empirical Development of anExponential Probabilistic Model

Using Textual Analysis to Build a Better Model

Jaime Teevan & David R. KargerCSAIL (LCS+AI), MIT

Goal: Better Generative Model

Generative v. discriminative modelApplies to many applications Information retrieval (IR)

Relevance feedback Using unlabeled data

Classification

Assumptions explicit

Using a Model for IR

1. Define model2. Learn parameters from query3. Rank documents

Hyper-learn

• Better model improves applications Trickle down to improve retrieval Classification, relevance feedback, …

• Corpus specific models

Overview

Related workProbabilistic models Example: Poisson Model Compare model to text

Hyper-learning the model Exponential framework Investigate retrieval performance

Conclusion and future work

Related Work

Using text for retrieval algorithm [Jones, 1972], [Greiff, 1998]

Using text to model text [Church & Gale, 1995], [Katz, 1996]

Learning model parameters [Zhai & Lafferty, 2002]

Hyper-learn the model from text!

Probabilistic Models

Rank documents by RV = Pr(rel|d)

Naïve Bayesian models

RV = Pr(rel|d)

Probabilistic Models

Rank documents by RV = Pr(rel|d)

Naïve Bayesian models

= Pr(dt|rel) features t

RV = Pr(rel|d) 8Open assumptionsFeature definitionFeature distribution family

# occs in doc

Defines the model!

Pr(d|rel)

Using a Naïve Bayesian Model

Pr(dt|rel) =

Pr(dt|rel) = θ e -θ

dtPoisson Model

θ: specifies term distribution

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

Poisson

Term occurs exactly dt

d t|rel)

Example Poisson Distribution

θ=0.0006

Pr(dt|rel)≈1E-15

Learn a θ for each term

Maximum likelihood θ Term’s average number of occurrence

Incorporate prior expectations

For each document, find RV

Sort documents by RV

= Pr(dt|rel)

. words t

For each document, find RV

Sort documents by RV

= Pr(dt|rel)

. words t

Which step goes wrong?

Pr(dt|rel) = θ e -θ

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

DataPoisson

d t|rel)

How Good is the Model?

θ=0.0006

15 times

How Good is the Model?

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

DataPoisson

d t|rel)

θ=0.0006

15 times

Misfit!

Hyper-learning a Better FitThrough Textual Analysis

Using an Exponential Framework

Need framework for hyper-learning

Bernoulli

Poisson

Normal

Mixtures

Hyper-Learning Framework

Need framework for hyper-learning

Goal: Same benefits as Poisson Model One parameter Easy to work with (e.g., prior)

Bernoulli

Poisson

Normal

One parameter exponential families

Mixtures

Hyper-Learning Framework

Well understood, learning easy [Bernardo & Smith, 1994], [Gous, 1998]

Pr( dt | rel ) = f(dt) g(θ) e

Functions f(dt) and h(dt) specify family E.g., Poisson: f(dt) = (dt!)-1, h(dt) = dt

Parameter θ term’s specific distribution

Exponential Framework

θ h(dt)

Using a Hyper-learned Model

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Want “best” f(dt) and h(dt)

Iterative hill climbing Local maximum Poisson starting point

Data: TREC query result sets Past queries to learn about future queries

Hyper-learn and test with different sets

Recall the Poisson Distribution

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

DataPoissonNew Model

d t|rel)

15 times

Poisson Starting Point - h(dt)

0 1 2 3 4 5

PoissonLearned

Pr(dt|rel) = f(dt) g(θ) eθ h(dt)

0 1 2 3 4 5

PoissonLearned

Hyper-learned Model - h(dt)Hyper-learned Model - h(dt)+

Pr(dt|rel) = f(dt) g(θ) eθ h(dt)

Poisson Distribution

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

d t|rel)

15 times

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

Hyper-learned Distribution

15 times

Hyper-learned Distribution+

d t|rel)

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

5 times

Hyper-learned DistributionHyper-learned Distribution+

d t|rel)

1E-171E-15

1E-131E-11

1E-091E-07

1E-050.001

0 1 2 3 4 5

30 times

d t|rel)

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

300 times

d t|rel)

Performing Retrieval

Pr( dt | rel ) = f(dt) g(θ) e

Learn θ for each term

θ h(dt)

Labeled docs

Learning θ

Sufficient statistics Summarize all observed data τ1: # of observations τ2: Σobservations d h(dt)

Incorporating prior easy

Map τ1 and τ2 θ

20 labeled documents

PoissonNew Model

Recall

Results: Labeled DocumentsResults: Labeled Documents

PoissonNew Model

Recall

Results: Labeled DocumentsResults: Labeled Documents

Short query

Query = single labeled documentVector space-like equation

RV = Σ a(t, d) + Σ b(q, d)

Problem: Document dominatesSolution: Use only query portion Another solution: Normalize

Retrieval: Query

t in doc q in query

Retrieval: Query

PoissonNew ModelTF.IDF

Recall

Retrieval: Query

Recall

Retrieval: Query

Recall

Retrieval: Query

Conclusion

Probabilistic models Example: Poisson Model

Hyper-learning the model Exponential framework Learned a better model Investigate retrieval performance

- Easy to work with

- Better …

- Bad text model

- Heavy tailed!

Use model betterUse for other applications Other IR applications Classification

Correct for document lengthHyper-learn on different corpora Test if learned model generalizes Different for genre? Language?

People?

Hyper-learn model better

Future Work

Questions?

Contact us with questions:

Jaime Teevanteevan@ai.mit.edu

David Kargerkarger@theory.lcs.mit.edu

Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better...

Documents

Transcript of Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better...

Haystack: Per-User Information Environments David Karger.

MIT Randomization in Graph Optimization Problems David Karger MIT karger.

Stitch Meshing - MIT CSAIL

Slide 1people.csail.mit.edu/teevan/work/publications/talks/sigird… · PPT file · Web viewThe Re:Search Engine Jaime Teevan MIT, CSAIL People Forget a Lot Change Blindness Change

Mental Sentences Burley Early Ockham_ Karger

Interconnection: An Economic Perspective Peyman Faratin (CSAIL) Steven Bauer (CSAIL) David Clark (CSAIL) Bill Lehr (CSAIL) Arthur W Berger (Akamai,CSAIL)

lecture23 - People | MIT CSAIL

Section 2: Finding and Refinding Jaime Teevan Microsoft Research 1.

Diane Kelly, Filip Radlinski, Jaime Teevan

Alan Edelman Mathematics CSAIL

How People Recall, Recognize, and Reuse Search Results 19people.csail.mit.edu/teevan/work/publications/papers/tois08.pdfSearch Results JAIME TEEVAN Microsoft Research When a person

Donna Teevan Lonergan Hermeneutics Amp Amp Theo

Forensic Ballistics Karger

Surviving the Information Explosion Christine Alvarado and Jaime Teevan.

Counting with the Crowd - VLDB · Counting with the Crowd Adam Marcus David Karger Samuel Madden Robert Miller Sewoong Oh MIT CSAIL fmarcua,karger,madden,rcmg@csail.mit.edu, swoh@illinois.edu

Karger, Stoesz 2

p601 Karger

Supporting Finding and Re-Findingpeople.csail.mit.edu/teevan/tmp/teevan-thesis.doc · Web viewSupporting Finding (and Re-Finding) through Personalization The first chapter (i) in

Algo Karger Overview Annotated

Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.