A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management &...

1
A Logistic Regression Approach to A Logistic Regression Approach to Distributed IR Distributed IR Ray R. Larson : Ray R. Larson : School of Information Management & Systems, University of California, Berkeley -- School of Information Management & Systems, University of California, Berkeley -- [email protected] [email protected] TREC Disk Source Size MB Size doc 1 WSJ (86-89) 270 98,732 1 AP (89) 259 84,678 1 ZIFF 245 75,180 1 FR (89) 262 25,960 2 WSJ (90-92) 247 74.520 2 AP (88) 241 79,919 2 ZIFF 178 56,920 2 FR (88) 211 19,860 3 AP (90) 242 78,321 3 SJMN (91) 290 90,257 3 PAT 245 6,711 Totals 2,690 691,058 TREC Disk Source Num DB Total DB 1 WSJ (86-89) 29 Disk 1 1 AP (89) 12 67 1 ZIFF 14 1 FR (89) 12 2 WSJ (90-92) 22 Disk 2 2 AP (88) 11 54 2 ZIFF 11 (1 dup) 2 FR (88) 10 3 AP (90) 12 Disk 3 3 SJMN (91) 12 116 3 PAT 92 Totals 237 - 1 237 - 1 6 1 0 ) , | ( i i i X c c C Q R P We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time. We calculate the probability of relevance using Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval time the probability of relevance for a particular query Q and a collection C is estimated by: Probabilistic Retrieval Using Probabilistic Retrieval Using Logistic Regression Logistic Regression For the 6 X attribute measures shown below: M X n n N ICF ICF M X CL X CAF M X QL X QAF M X j j j j j t t M t M t M t log log 1 10 log 1 log 1 6 1 5 4 1 3 2 1 1 Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection Length Average Inverse Collection Frequency Inverse Collection Frequency (N = Number of collections M = Number of Terms in common between query and document S i i i X c c C Q R O 1 0 ) , | ( log ) 5491 . 7 ( ) 2501 . 6 ( ) 0053 . 0 ( ) 1921 . 1 ( ) 8669 . 1 ( ) 6801 . 9 ( 9850 . 20 , | log queries query) TREC (full long For very ) 3612 . 2 ( ) 9174 . 5 ( ) 0029 . 0 ( ) 0695 . 1 ( ) 8669 . 1 ( ) 3188 . 2 ( 0103 . 7 , | log queries n) descriptio and (title long For . (316.2277) constant a is where ) 01 . 2 ( ) 223 . 0 ( ) 679 . 0 ( ) 310 . 0 ( ) 269 . 1 ( 70 . 3 , | log queries only) (title short For 6 5 4 3 2 1 6 5 4 3 2 1 6 5 3 2 1 X X X X X X C Q R O X X X X X X C Q R O K X X K X X X C Q R O probabilities are actually calculated as the log odds, and converted c i coefficients were estimated separately for three query types ing retrieval the length of the query was used to differentiate these. Test Database Characteristics Test Database Characteristics The Problem The Problem We used collections formed by dividing the documents on TIPSTER disks 1, 2, and 3 into 236 sets based on source and month (using the same contents as in evaluations by Powell & French and Callan). The query set used was TREC queries 51-150. Collection relevance information was based on whether any documents in the collection were relevant according to the relevance judgements for TREC queries 51-150. The relevance information was used both for estimating the logistic regression coefficients (using a sample of the data) and for the evaluation (with full data). Hundreds or Thousands of servers with databases ranging widely in content, topic, and format Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results How to select the “best” ones to search? • What to search first Which to search next Topical /domain constraints on the search selections Variable contents of database (metadata only, full text…) Resource Description How to collect metadata about digital libraries and their collections or databases Resource Selection How to select relevant digital library collections or databases from a large number of databases Distributed Search How to perform parallel or sequential searching over the selected digital library databases Data Fusion How to merge query results from different digital libraries with their different search engines, differing record structures, etc. Distributed IR Distributed IR Tasks Tasks This research was sponsored at U.C. Berkeley and the University of Liverpool by the National Science Foundation and the Joint Information Systems Committee (UK) under the International Digital Libraries Program award #IIS-99755164 James French and Allison Powell kindly provided the CORI and Ideal(0) results used in the evaluation. Acknowledgements Acknowledgements MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2 DB 1 Map Query Map Results Search Engine DB 4 DB 3 Distributed Index Search Engine Db 6 Db 5 Our Approach Using Our Approach Using Z39.50 Z39.50 Replicated servers Meta-Topical Servers General Servers Database Servers – Tested using the collection representatives as harvested from over the network and the TIPSTER relevance judgements – Testing by comparing our approach to known algorithms for ranking collections – Results (preliminary) were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) – Recall analog (How many of the Rel docs occurred in the top n databases 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 209 217 225 233 Prob M ax CORI Ideal() Very Long Very Long Queries Queries ^ R Number of Collections 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 209 217 225 233 Ideal() M ax CORI Prob Number of Collections ^ R Title Queries Title Queries Long Queries Long Queries Number of Collections ^ R 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 209 217 225 233 M ax CORI Ideal() Prob Distributed Retrieval Testing and Distributed Retrieval Testing and Results Results CORI Ranking CORI Ranking ranke being databases the of average the is in words of number the is ranked being databases of number the is | | containing databases of number is containing documents of number is : where 6 . 0 4 . 0 ) | ( 0 . 1 | | log 5 . 0 | | log / 150 50 cw cw db cw DB r cf r df I T db r p DB cf DB I cw cw df df T i k k i k Assume each database has some merit for a given query, q Given a Baseline ranking B and an estimated (test) ranking E for the evaluated system Let db bi and db ei denote the database in the i-th ranked position of rankings B and E Let B i = merit(q, db bi ) and E i = merit(q, We can define some measures: * 1 1 * 1 1 ˆ 0 such that max n i i n i i n k n i i n i i n B E R B k n B E R Recall Analogs And a Precision Analog merit. zero - non with ranking estimated in th databases top the of fraction the I.e., | ) ( | | } 0 ) , ( | ) ( { | n E Top db q merit E Top db P n n n Effectiveness Effectiveness Measures Measures Comparative Evaluation MetaSearch New approach to building metasearch based on Z39.50 Instead of using broadcast search we are using two Z39.50 Services -- Identification of database metadata using Z39.50 Explain -- Extraction of distributed indexes using Z39.50 SCAN -- Creation of “Collection Documents” using index contents Evaluation Questions: -- How efficiently can we build distributed indexes? -- How effectively can we choose databases using the index? -- How effective is merging search results from multiple sources? -- Do Hierarchies of servers (general/meta-topical/individual) work? For all servers (could be a topical subset)… Get Explain information to find which indexes are supported and the collection statistics. For each index Use SCAN to extract terms and frequency information Add term + freq + source index + database metadata to the metasearch XML “Collection Document” Index collection documents including for retrieval by the above algorithm Planned Exensions Post-Process indexes (especially Geo Names, etc) for special types of data e.g. create “geographical coverage” indexes Application and Further Application and Further Research Research Data Harvesting and Collection Document Data Harvesting and Collection Document Creation Creation The figures to the right summarize our results from the preliminary evaluation. The X axis is the number of collections in the ranking and the Y axis, , is a Recall analog that measures the proportion of the total possible relevant documents that have been accumulated in the top N databases, averaged across all of the queries. The Max line is the optimal results based where the collections are ranked in order of the number of relevant documents they contain. Ideal(0) is an implementation of the GlOSS ``Ideal''algorithm and CORI is an implementation of Callan's Inference net approach. The Prob line is the logistic regression method (described to the left). For title queries the described method performs slightly better than the CORI algorithm for up to about 100 collections, where CORI exceeds it. For Long queries our method is virtually identical to CORI, and CORI performs better for Very Long queries. Both CORI and the logistic regression method outperform the Ideal(0) implementation. Results and Results and Discussion Discussion ^ R The method described here is being applied to two distributed systems of servers in the UK. The first (the Distribute Hub will be made up of individual servers containing archival descriptions in the EAD (Encoded Archival Description) MerseyLibraries.org is a consortium of University and Public libraries in the Merseyside area. In both cases the meth here is being used to build a central index to provide efficient distributed search over the various servers. The bas below – individual database servers will be harvested to create (potentially) a hierarchy of servers used to intelli to the databases most like to contain relevant materials. We are also continuing to refine the both the harvesting method and the collection ranking algorithm. We believe that collection and collection document statistics may provide a better ranking of results and thus more effective routing

Transcript of A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management &...

Page 1: A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --

A Logistic Regression Approach to Distributed IRA Logistic Regression Approach to Distributed IRRay R. Larson : Ray R. Larson : School of Information Management & Systems, University of California, Berkeley -- [email protected] of Information Management & Systems, University of California, Berkeley -- [email protected]

TREC Disk Source Size MB Size doc

1 WSJ (86-89) 270 98,732

1 AP (89) 259 84,678

1 ZIFF 245 75,180

1 FR (89) 262 25,960

2 WSJ (90-92) 247 74.520

2 AP (88) 241 79,919

2 ZIFF 178 56,920

2 FR (88) 211 19,860

3 AP (90) 242 78,321

3 SJMN (91) 290 90,257

3 PAT 245 6,711

Totals 2,690 691,058

TREC Disk Source Num DB Total DB

1 WSJ (86-89) 29 Disk 1

1 AP (89) 12 67

1 ZIFF 14

1 FR (89) 12

2 WSJ (90-92) 22 Disk 2

2 AP (88) 11 54

2 ZIFF 11 (1 dup)

2 FR (88) 10

3 AP (90) 12 Disk 3

3 SJMN (91) 12 116

3 PAT 92

Totals 237 - 1 237 - 1

6

10),|(

iii XccCQRP

We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time.We calculate the probability of relevance using Logistic regression from a sample set of documents to determine values of the coefficients.At retrieval time the probability of relevance for a particular query Q and a collection C is estimated by:

Probabilistic Retrieval Using Logistic RegressionProbabilistic Retrieval Using Logistic Regression

For the 6 X attribute measures shown below:

MX

n

nNICF

ICFM

X

CLX

CAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log110

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Collection Frequency

Collection Length

Average Inverse Collection Frequency

Inverse Collection Frequency (N = Number of collections

M = Number of Terms in common between query and document

S

iii XccCQRO

10),|(log

)5491.7()2501.6()0053.0()1921.1(

)8669.1()6801.9(9850.20,|logqueries query) TREC (full longFor very

)3612.2()9174.5()0029.0()0695.1(

)8669.1()3188.2(0103.7,|logqueries n)descriptio and (title longFor

.(316.2277)constant a is where)01.2()223.0(

)679.0()310.0()269.1(70.3,|log

queries only) (titleshort For

65

43

21

65

43

21

65

3

21

XXXX

XXCQRO

XXXX

XXCQRO

KXX

KXXXCQRO

The probabilities are actually calculated as the log odds, and converted

The ci coefficients were estimated separately for three query types(during retrieval the length of the query was used to differentiate these.

Test Database CharacteristicsTest Database Characteristics

The ProblemThe Problem

We used collections formed by dividing the documents on TIPSTER disks 1, 2, and 3 into 236 sets based on source and month (using the same contents as in evaluations by Powell & French and Callan). The query set used was TREC queries 51-150. Collection relevance information was based on whether any documents in the collection were relevant according to the relevance judgements for TREC queries 51-150. The relevance information was used both for estimating the logistic regression coefficients (using a sample of the data) and for the evaluation (with full data).

• Hundreds or Thousands of servers with databases ranging widely in content, topic, and format– Broadcast search is expensive in terms of bandwidth

and in processing too many irrelevant results– How to select the “best” ones to search?

• What to search first• Which to search next

– Topical /domain constraints on the search selections– Variable contents of database (metadata only, full

text…)

• Resource Description– How to collect metadata about digital libraries and their collections

or databases• Resource Selection

– How to select relevant digital library collections or databases from a large number of databases

• Distributed Search– How to perform parallel or sequential searching over the selected

digital library databases• Data Fusion

– How to merge query results from different digital libraries with their different search engines, differing record structures, etc.

Distributed IR TasksDistributed IR Tasks

This research was sponsored at U.C. Berkeley and the University of Liverpoolby the National Science Foundation and the Joint Information Systems Committee (UK)

under the International Digital Libraries Program award #IIS-99755164James French and Allison Powell kindly provided the CORI and Ideal(0) results used in

the evaluation.

AcknowledgementsAcknowledgements

MetaSearchServer

Map ExplainAnd ScanQueries

InternetMap

Results

MapQuery

MapResults

SearchEngine

DB2DB 1Map

Query

MapResults

SearchEngine

DB 4DB 3

DistributedIndex

SearchEngine

Db 6Db 5

Our Approach Using Z39.50Our Approach Using Z39.50

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers

– Tested using the collection representatives as harvested from over the network and the TIPSTER relevance judgements

– Testing by comparing our approach to known algorithms for ranking collections– Results (preliminary) were measured against reported results for the Ideal and CORI algorithms and against

the optimal “Relevance Based Ranking” (MAX)– Recall analog (How many of the Rel docs occurred in the top n databases – averaged)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

145

153

161

169

177

185

193

201

209

217

225

233

ProbMaxCORIIdeal()

Very Long QueriesVery Long Queries

Number of Collections

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

145

153

161

169

177

185

193

201

209

217

225

233

Ideal()MaxCORIProb

Number of Collections

Title QueriesTitle Queries

Long QueriesLong Queries

Number of Collections

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

145

153

161

169

177

185

193

201

209

217

225

233

MaxCORIIdeal()Prob

Distributed Retrieval Testing and ResultsDistributed Retrieval Testing and Results

CORI RankingCORI Ranking

ranked being databases theof average theis

in wordsofnumber theis ranked being databases ofnumber theis ||

containing databases ofnumber is containing documents ofnumber is

:where6.04.0)|(

0.1||log

5.0||log

/15050

cwcw

dbcwDB

rcfrdf

ITdbrpDB

cfDB

I

cwcwdfdfT

i

k

k

ik

Assume each database has some merit for a given query, q

Given a Baseline ranking B and an estimated (test) ranking E for the evaluated system

Let dbbi and dbei denote the database in the i-th ranked position of rankings B and E

Let Bi = merit(q, dbbi) and Ei = merit(q, dbei)

We can define some measures:

*

1

1

*

1

1

ˆ

0such that max

n

ii

n

ii

n

k

n

ii

n

ii

n

B

ER

Bkn

B

ER

Recall Analogs And a Precision Analog

merit. zero-non with ranking estimated in the databases top theoffraction theI.e.,

|)(||}0),(|)({|

nETop

dbqmeritETopdbPn

nn

Effectiveness MeasuresEffectiveness Measures

Comparative Evaluation

MetaSearch New approach to building metasearch based on Z39.50 Instead of using broadcast search we are using two Z39.50

Services -- Identification of database metadata using Z39.50 Explain -- Extraction of distributed indexes using Z39.50 SCAN -- Creation of “Collection Documents” using index contentsEvaluation Questions: -- How efficiently can we build distributed indexes? -- How effectively can we choose databases using the index? -- How effective is merging search results from multiple

sources? -- Do Hierarchies of servers (general/meta-topical/individual)

work?

• For all servers (could be a topical subset)…– Get Explain information to find which indexes are supported and the collection statistics.– For each index

• Use SCAN to extract terms and frequency information• Add term + freq + source index + database metadata to the metasearch XML “Collection Document”

– Index collection documents including for retrieval by the above algorithm• Planned Exensions

• Post-Process indexes (especially Geo Names, etc) for special types of data – e.g. create “geographical coverage” indexes

Application and Further ResearchApplication and Further Research

Data Harvesting and Collection Document CreationData Harvesting and Collection Document Creation

The figures to the right summarize our results from the preliminary evaluation. The X axis is the number of collections in the ranking and the Y axis, , is a Recall analog that measures the proportion of the total possible relevant documents that have been accumulated in the top N databases, averaged across all of the queries.The Max line is the optimal results based where thecollections are ranked in order of the number of relevant documents they contain. Ideal(0) is an implementation of the GlOSS ``Ideal''algorithm and CORI is an implementation of Callan's Inference netapproach. The Prob line is the logistic regression method (described to the left). For title queries the described method performs slightly better than the CORI algorithm for up to about 100 collections, where CORI exceeds it. For Long queries our method is virtually identical to CORI, and CORI performs better for Very Long queries. Both CORI and the logistic regression method outperform the Ideal(0) implementation.

Results and DiscussionResults and Discussion

The method described here is being applied to two distributed systems of servers in the UK. The first (the Distributed ArchivesHub will be made up of individual servers containing archival descriptions in the EAD (Encoded Archival Description) DTD.MerseyLibraries.org is a consortium of University and Public libraries in the Merseyside area. In both cases the method describedhere is being used to build a central index to provide efficient distributed search over the various servers. The basic model is shownbelow – individual database servers will be harvested to create (potentially) a hierarchy of servers used to intelligently route queriesto the databases most like to contain relevant materials.We are also continuing to refine the both the harvesting method and the collection ranking algorithm. We believe that additional collection and collection document statistics may provide a better ranking of results and thus more effective routing of queries.