Effective Topic Distillation with Key Resource Pre...

22
Effective Topic Distillation Effective Topic Distillation with Key Resource Pre with Key Resource Pre - - selection selection Yiqun Liu, Min Zhang and Shaoping Ma State Key Lab of Intelligent Tech. & Sys. Tsinghua University, Beijing, 100084 [email protected] (2004/10/19)

Transcript of Effective Topic Distillation with Key Resource Pre...

Page 1: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

Effective Topic Distillation Effective Topic Distillation with Key Resource Prewith Key Resource Pre--selectionselection

Yiqun Liu, Min Zhang and Shaoping Ma

State Key Lab of Intelligent Tech. & Sys. Tsinghua University, Beijing, 100084

[email protected]

(2004/10/19)

Page 2: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

OutlineOutline

• Why Key Resource Pre-selection?

• Possibilities of selecting key resources

• How to select key resources?

• Experiments

• Conclusion

Page 3: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Why Key resource selection? (1)Why Key resource selection? (1)

• The amount of web pages

Medium 2002 Internet

Surface Web 167 TB

Deep Web 91,850 TB

#Surface web pages 20 billion

#Deep web pages 130 billion

According to "How Much Information", 2003. http://www.sims.berkeley.edu/how-much-info-2003.

Page 4: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Why Key resource selection? (2)Why Key resource selection? (2)

• Index amount of web search engine

GG=Google,

ATW=AllTheWeb,

INK=Inktomi,

TMA=Teoma,

AV=AltaVista

Billions Of Textual Documents IndexedBillions Of Textual Documents Indexed

According to a report by search engine watch website; September 2, 2003

Less than 1/6

Page 5: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Why Key resource selection? (3)Why Key resource selection? (3)

Not all pages can be indexed by web IR tools

Many pages Indexed aren’t key resources

TD is difficult Key ResourceSelection

Page 6: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Definitions of TD and key resourceDefinitions of TD and key resource

• Key Resource (Key Resource Page)– High-quality web pages for a particular topic

• Offering credible information/service for this topic

• Introducing other useful web pages for this topic

– Key resources are only a small part of relevant pages

• Topic Distillation (TD)– To find key resources for certain topics

– A major task for web search (it covers over 70% web search queries)

Page 7: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

• Selecting key resources is useful for TD

• Possibilities of selecting key resources– Is there any difference between ordinary pages and key r

esource pages?

• How to select key resources?

• Experiments

• Conclusion

OutlineOutline

Page 8: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

NonNon--content features of key resourcescontent features of key resources

• Key resources v.s. ordinary pages (non-content features) – Common-used features

• In-degree, URL-type, Doc-length– Features involving site’s self-link analysis

• In-site out-link number, anchor text rate

• Two Data sets to compare the differences– Key resource page training set

• Built with TREC 11 TD task relevant qrels

– Ordinary page set: .GOV (over 1.2M web pages from .GOV domain)

Page 9: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

InIn--degreedegree

• Key resource pages have more in-links

Page 10: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

URLURL--typetype

• Key resource pages tend to be non-FILE type

Page 11: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

0.00%

3.00%

6.00%

9.00%

12.00%

15.00%

18.00%

<200 600 1000 3000 5000 7000 9000 20000 >30000

Training Set .GOV Corpus

DocDoc--lengthlength

• Key resources don’t have too few words

Page 12: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

InIn--site Outsite Out--link analysislink analysis

• Definition

• Feature– In-site out-link number– In-site out-link anchor text rate

Site AP1 P2

1 23

)textfullpageweb(WordCount)anchorlinkoutsitein(WordCountrate −−

=

Page 13: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

InIn--site Outsite Out--link analysislink analysis

• Key resource pages have more in-site out-links and longer in-site out-link anchor texts

In-site out-link anchor text rateIn-site out-link anchor number

Page 14: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

• Selecting key resources is useful for TD

• Possibilities of selecting key resources

• How to select key resources?– Construction of a key resource decision tree

• Experiments

• Conclusion

OutlineOutline

Page 15: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Construction of a key resource decision treeConstruction of a key resource decision tree

• Why decision tree?– The most effective and efficient classifier when there are small

number of features • 5 non-content features

– Providing a metric to estimate quality of these features in the form of

• Information gain (ID3)

• Information ratio (C4.5)

Page 16: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Construction of a key resource decision treeConstruction of a key resource decision tree

68.53% of .GOV

Page 17: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

OutlineOutline

• Selecting key resources is useful for TD• Possibilities of selecting key resources• How to select key resources?• Experiments

– Is this key resource selection process effective?– Does TD perform better on the key resource result set?

• conclusion

Page 18: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Is this key resource selection process eIs this key resource selection process effective?ffective?

• Key resource selection algorithm is effective

70%

20%

Page 19: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Does TD perform better on the key resDoes TD perform better on the key resource result set?ource result set?

• Test set:– From TREC 2003 TD task

– 50 topics and corresponding relevant qrels

• Evaluation Metrics:– Precision at 10 documents

– R-precision (precision at #relevant documents)

• Weighting– BM2500 ranking, default parameters

Page 20: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Does TD perform better on the key resDoes TD perform better on the key resource result set?ource result set?

• Text retrieval on different data set

G = .GOV corpusK = Key resource

setF = Full text A = Anchor text T = Trec 2003 best

run

76%

83%

24.89% .GOV data

Page 21: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Conclusion Conclusion

• Key resource pre-selection is needed for TD– Finding high quality pages independent of a given user request

• A new type of non-content features– In-site out-link analyses

• Algorithm of using decision tree to find key resources• Key resource page set:

– use less than 20% .GOV pages– cover more than 70% key resource information– get better performance than whole page set

(There is 76% performance improvement in p@10)

Page 22: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping

For AIRS presentation 04/10/19

Welcome to contact me:

[email protected]

Thank you!

Questions and comments?