Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

Post on 12-Jan-2016

30 views 1 download

description

Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track. Jing Jiang, Xin He, ChengXiang Zhai University of Illinois at Urbana-Champaign. Goal of Participation. To test the effectiveness of some recent language modeling methods for genomics retrieval - PowerPoint PPT Presentation

Transcript of Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

Robust Pseudo Feedback& HMM Passage Extraction

UIUC at TREC 2006 Genomics Track

Jing Jiang, Xin He, ChengXiang ZhaiUniversity of Illinois at Urbana-Champaign

11/16/06 2

Goal of Participation

• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]

– HMM passage extraction [Jiang & Zhai 06]

• Task at 2006 genomics track– Document-level retrieval– Passage-level retrieval– Aspect-level retrieval

11/16/06 3

Overall Approach

QDocument Retrieval Module

1

Medline articles paragraphs

Passage Extraction

Module2

k

1 2 k…

ranked paragraphs

ranked passages

user relevance feedback

pseudo relevance feedback

11/16/06 4

Goal of Participation

• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]

– HMM passage extraction [Jiang & Zhai 06]

11/16/06 5

KL-Divergence Retrieval Model[Lafferty & Zhai 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

the 0.020for 0.015prp 0.102mad 0.034cow 0.034diseas 0.068… …

topic

document

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

11/16/06 6

KL-Divergence Retrieval Model[Lafferty & Zhai 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

the 0.020for 0.015prp 0.102mad 0.034cow 0.034diseas 0.068… …

document

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic …

11/16/06 7

Model-Based Feedback[Zhai & Lafferty 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the ?for ?… …prp ?prion ?

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

11/16/06 8

Model-Based Feedback[Zhai & Lafferty 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the 0.003for 0.002… …prp 0.02prion 0.05

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

EM algorithm

11/16/06 9

Model-Based Feedback[Zhai & Lafferty 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the 0.003for 0.002… …prp 0.02prion 0.05

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

…2 parametersα and λ

11/16/06 10

Regularized Estimation[Tao & Zhai 06]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the ?for ?… …prp ?prion ?

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

11/16/06 11

Regularized Estimation[Tao & Zhai 06]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the 0.003for 0.002… …prp 0.02prion 0.05

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

prior

regularized EM

algorithm

11/16/06 12

Regularized Estimation[Tao & Zhai 06]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the 0.003for 0.002… …prp 0.02prion 0.05

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

prior

…1 parameter η

11/16/06 13

D1

D2

Dk

Original vs. Regularized EMoriginal regularized

D1

D2

Dk

α

D1

D2

Dk

α

α

α dynamically set

α manually set

11/16/06 14

Goal of Participation

• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]

– HMM passage extraction [Jiang & Zhai 06]

11/16/06 15

HMM Passage Extraction[Jiang & Zhai 06]

p(w|B1) the: 0.02 for: 0.01 prp: 0.001 …

p(w|R) the: 0.003 for: 0.002 prp: 0.02 …

p(w|B2) the: 0.02 for: 0.01 prp: 0.001 …

B1 R B2p(R|B1)

= 0.1p(B2|R)= 0.05

p(B1|B1)= 0.9

p(R|R)= 0.95

p(B2|B2)= 1

HMM

B R…B B …R R R R B … BR

relevant passage

w w…w w …w w w w w … ww

paragraph

11/16/06 16

HMM Passage Extraction[Jiang & Zhai 06]

B2

B1 R B3 E

a background state for smoothing

end-of-paragraphstate

transition probabilities estimated from observations

11/16/06 17

Experiment Design

• Pre-processing– HTML parsing– paragraph boundaries – Tokenization

• User relevance feedback

11/16/06 18

Official Runs

Q KL-Div Retrieval

1

Medline articles paragraphs

HMM Passage

Extraction2

k

1 2 k…

ranked paragraphs

ranked passages

Q'

11/16/06 19

UIUCauto

Q KL-Div Retrieval

1

Medline articles paragraphs

HMM Passage

Extraction2

k

1 2 k…

ranked paragraphs

ranked passages

Q'

regularized estimation

11/16/06 20

UIUCinter

Q KL-Div Retrieval

1

Medline articles paragraphs

HMM Passage

Extraction2

k

1 2 k…

ranked paragraphs

ranked passages

regularized estimation

Q'

11/16/06 21

UIUCinter2

Q KL-Div Retrieval

1

Medline articles paragraphs

HMM Passage

Extraction2

k

1 2 k…

ranked paragraphs

ranked passages

original estimation

Q'F

11/16/06 22

Pseudo Relevance Feedback(k = 10)

Method Doc MAP Rel. Impr.

Baseline (no feedback) 0.3484 N/A

Original Estimation

Def 0.3606 +3.50%

Opt 0.3943 +13.2%

Regularized Estimation

Def0.3842

(UIUCauto)+10.3%

Opt 0.3952 +13.4%

η is similar to λ / (1 − λ)

11/16/06 23

Pseudo Relevance Feedback(k = 10)

η is similar to λ / (1 − λ)

Method Doc MAP Rel. Impr.

Baseline (no feedback) 0.3484 N/A

Original Estimation

Def 0.3606 +3.50%

Opt 0.3943 +13.2%

Regularized Estimation

Def0.3842

(UIUCauto)+10.3%

Opt 0.3952 +13.4%

11/16/06 24

Pseudo Relevance Feedback(k = 10)

Method Doc MAP Rel. Impr.

Baseline (no feedback) 0.3484 N/A

Original Estimation

Def 0.3606 +3.50%

Opt 0.3943 +13.2%

Regularized Estimation

Def0.3842

(UIUCauto)+10.3%

Opt 0.3952 +13.4%

η is similar to λ / (1 − λ)

11/16/06 25

Parameter Sensitivity(pseudo feedback, k = 10)

11/16/06 26

User Relevance Feedback

MethodDoc MAP

Pseudo Feedback

User Feedback

Rel. Impr.

Original Estimation

Def 0.3606 0.3986 +10.5%

Opt 0.3943 0.4511 +14.4%

Regularized Estimation

Def0.3842

(UIUCauto)0.4261

(UIUCinter)+10.9%

Opt 0.3952 0.4515 +14.2%

11/16/06 27

User Relevance Feedback

MethodDoc MAP

Pseudo Feedback

User Feedback

Rel. Impr.

Original Estimation

Def 0.3606 0.3986 +10.5%

Opt 0.3943 0.4511 +14.4%

Regularized Estimation

Def0.3842

(UIUCauto)0.4261

(UIUCinter)+10.9%

Opt 0.3952 0.4515 +14.2%

11/16/06 28

User Relevance Feedback

MethodDoc MAP

Pseudo Feedback

User Feedback

Rel. Impr.

Original Estimation

Def 0.3606 0.3986 +10.5%

Opt 0.3943 0.4511 +14.4%

Regularized Estimation

Def0.3842

(UIUCauto)0.4261

(UIUCinter)+10.9%

Opt 0.3952 0.4515 +14.2%

11/16/06 29

HMM Passage Extraction

Method Psg MAP

UIUCauto

Paragraph 0.03753

HMM Passage 0.04864

Rel. Impr. +29.6%

UIUCinter

Paragraph 0.04481

HMM Passage 0.05906

Rel. Impr. +31.8%

UIUCinter2

Paragraph 0.04580

HMM Passage 0.06038

Rel. Impr. +31.8%

11/16/06 30

Passage Length (In Bytes)

Max Min Avg Std

True Passages 6928 27 399.8 489.4

HMM Passages 6955 34 1525.8 949.7

Paragraph 8670 60 2105.4 1136.8

HMM passages are generally too long!

11/16/06 31

Example PassagePrion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.

11/16/06 32

Example PassagePrion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.

11/16/06 33

Conclusions and Future Work• The two language modeling methods in general

works well in genomics domain– Regularized feedback estimation can effectively

eliminates parameter α– HMM passages improves over paragraphs

• User relevance feedback is effective• Limitations and future work

– Regularized feedback estimation still has parameter η to tune

• How to eliminate η?

– The inherent coherence property of HMM passages may not suit the task well

• Different/better HMM architecture?

11/16/06 34

The End

• Questions?