A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management &...

Post on 18-Jan-2018

215 views 0 download

Transcript of A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management &...

A Logistic Regression Approach to Distributed IRA Logistic Regression Approach to Distributed IRRay R. Larson : Ray R. Larson : School of Information Management & Systems, University of California, Berkeley -- ray@sherlock.berkeley.eduSchool of Information Management & Systems, University of California, Berkeley -- ray@sherlock.berkeley.edu

TREC Disk Source Size MB Size doc

1 WSJ (86-89) 270 98,732

1 AP (89) 259 84,678

1 ZIFF 245 75,180

1 FR (89) 262 25,960

2 WSJ (90-92) 247 74.520

2 AP (88) 241 79,919

2 ZIFF 178 56,920

2 FR (88) 211 19,860

3 AP (90) 242 78,321

3 SJMN (91) 290 90,257

3 PAT 245 6,711

Totals 2,690 691,058

TREC Disk Source Num DB Total DB

1 WSJ (86-89) 29 Disk 1

1 AP (89) 12 67

1 ZIFF 14

1 FR (89) 12

2 WSJ (90-92) 22 Disk 2

2 AP (88) 11 54

2 ZIFF 11 (1 dup)

2 FR (88) 10

3 AP (90) 12 Disk 3

3 SJMN (91) 12 116

3 PAT 92

Totals 237 - 1 237 - 1

6

10),|(

iii XccCQRP

We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time.We calculate the probability of relevance using Logistic regression from a sample set of documents to determine values of the coefficients.At retrieval time the probability of relevance for a particular query Q and a collection C is estimated by:

Probabilistic Retrieval Using Logistic RegressionProbabilistic Retrieval Using Logistic Regression

For the 6 X attribute measures shown below:

MX

n

nNICF

ICFM

X

CLX

CAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log110

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Collection Frequency

Collection Length

Average Inverse Collection Frequency

Inverse Collection Frequency (N = Number of collections

M = Number of Terms in common between query and document

S

iii XccCQRO

10),|(log

)5491.7()2501.6()0053.0()1921.1(

)8669.1()6801.9(9850.20,|logqueries query) TREC (full longFor very

)3612.2()9174.5()0029.0()0695.1(

)8669.1()3188.2(0103.7,|logqueries n)descriptio and (title longFor

.(316.2277)constant a is where)01.2()223.0(

)679.0()310.0()269.1(70.3,|log

queries only) (titleshort For

65

43

21

65

43

21

65

3

21

XXXX

XXCQRO

XXXX

XXCQRO

KXX

KXXXCQRO

The probabilities are actually calculated as the log odds, and converted

The ci coefficients were estimated separately for three query types(during retrieval the length of the query was used to differentiate these.

Test Database CharacteristicsTest Database Characteristics

The ProblemThe Problem

We used collections formed by dividing the documents on TIPSTER disks 1, 2, and 3 into 236 sets based on source and month (using the same contents as in evaluations by Powell & French and Callan). The query set used was TREC queries 51-150. Collection relevance information was based on whether any documents in the collection were relevant according to the relevance judgements for TREC queries 51-150. The relevance information was used both for estimating the logistic regression coefficients (using a sample of the data) and for the evaluation (with full data).

• Hundreds or Thousands of servers with databases ranging widely in content, topic, and format– Broadcast search is expensive in terms of bandwidth

and in processing too many irrelevant results– How to select the “best” ones to search?

• What to search first• Which to search next

– Topical /domain constraints on the search selections– Variable contents of database (metadata only, full

text…)

• Resource Description– How to collect metadata about digital libraries and their collections

or databases• Resource Selection

– How to select relevant digital library collections or databases from a large number of databases

• Distributed Search– How to perform parallel or sequential searching over the selected

digital library databases• Data Fusion

– How to merge query results from different digital libraries with their different search engines, differing record structures, etc.

Distributed IR TasksDistributed IR Tasks

This research was sponsored at U.C. Berkeley and the University of Liverpoolby the National Science Foundation and the Joint Information Systems Committee (UK)

under the International Digital Libraries Program award #IIS-99755164James French and Allison Powell kindly provided the CORI and Ideal(0) results used in

the evaluation.

AcknowledgementsAcknowledgements

MetaSearchServer

Map ExplainAnd ScanQueries

InternetMap

Results

MapQuery

MapResults

SearchEngine

DB2DB 1Map

Query

MapResults

SearchEngine

DB 4DB 3

DistributedIndex

SearchEngine

Db 6Db 5

Our Approach Using Z39.50Our Approach Using Z39.50

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers

– Tested using the collection representatives as harvested from over the network and the TIPSTER relevance judgements

– Testing by comparing our approach to known algorithms for ranking collections– Results (preliminary) were measured against reported results for the Ideal and CORI algorithms and against

the optimal “Relevance Based Ranking” (MAX)– Recall analog (How many of the Rel docs occurred in the top n databases – averaged)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

145

153

161

169

177

185

193

201

209

217

225

233

ProbMaxCORIIdeal()

Very Long QueriesVery Long Queries

Number of Collections

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

145

153

161

169

177

185

193

201

209

217

225

233

Ideal()MaxCORIProb

Number of Collections

Title QueriesTitle Queries

Long QueriesLong Queries

Number of Collections

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

145

153

161

169

177

185

193

201

209

217

225

233

MaxCORIIdeal()Prob

Distributed Retrieval Testing and ResultsDistributed Retrieval Testing and Results

CORI RankingCORI Ranking

ranked being databases theof average theis

in wordsofnumber theis ranked being databases ofnumber theis ||

containing databases ofnumber is containing documents ofnumber is

:where6.04.0)|(

0.1||log

5.0||log

/15050

cwcw

dbcwDB

rcfrdf

ITdbrpDB

cfDB

I

cwcwdfdfT

i

k

k

ik

Assume each database has some merit for a given query, q

Given a Baseline ranking B and an estimated (test) ranking E for the evaluated system

Let dbbi and dbei denote the database in the i-th ranked position of rankings B and E

Let Bi = merit(q, dbbi) and Ei = merit(q, dbei)

We can define some measures:

*

1

1

*

1

1

ˆ

0such that max

n

ii

n

ii

n

k

n

ii

n

ii

n

B

ER

Bkn

B

ER

Recall Analogs And a Precision Analog

merit. zero-non with ranking estimated in the databases top theoffraction theI.e.,

|)(||}0),(|)({|

nETop

dbqmeritETopdbPn

nn

Effectiveness MeasuresEffectiveness Measures

Comparative Evaluation

MetaSearch New approach to building metasearch based on Z39.50 Instead of using broadcast search we are using two Z39.50

Services -- Identification of database metadata using Z39.50 Explain -- Extraction of distributed indexes using Z39.50 SCAN -- Creation of “Collection Documents” using index contentsEvaluation Questions: -- How efficiently can we build distributed indexes? -- How effectively can we choose databases using the index? -- How effective is merging search results from multiple

sources? -- Do Hierarchies of servers (general/meta-topical/individual)

work?

• For all servers (could be a topical subset)…– Get Explain information to find which indexes are supported and the collection statistics.– For each index

• Use SCAN to extract terms and frequency information• Add term + freq + source index + database metadata to the metasearch XML “Collection Document”

– Index collection documents including for retrieval by the above algorithm• Planned Exensions

• Post-Process indexes (especially Geo Names, etc) for special types of data – e.g. create “geographical coverage” indexes

Application and Further ResearchApplication and Further Research

Data Harvesting and Collection Document CreationData Harvesting and Collection Document Creation

The figures to the right summarize our results from the preliminary evaluation. The X axis is the number of collections in the ranking and the Y axis, , is a Recall analog that measures the proportion of the total possible relevant documents that have been accumulated in the top N databases, averaged across all of the queries.The Max line is the optimal results based where thecollections are ranked in order of the number of relevant documents they contain. Ideal(0) is an implementation of the GlOSS ``Ideal''algorithm and CORI is an implementation of Callan's Inference netapproach. The Prob line is the logistic regression method (described to the left). For title queries the described method performs slightly better than the CORI algorithm for up to about 100 collections, where CORI exceeds it. For Long queries our method is virtually identical to CORI, and CORI performs better for Very Long queries. Both CORI and the logistic regression method outperform the Ideal(0) implementation.

Results and DiscussionResults and Discussion

The method described here is being applied to two distributed systems of servers in the UK. The first (the Distributed ArchivesHub will be made up of individual servers containing archival descriptions in the EAD (Encoded Archival Description) DTD.MerseyLibraries.org is a consortium of University and Public libraries in the Merseyside area. In both cases the method describedhere is being used to build a central index to provide efficient distributed search over the various servers. The basic model is shownbelow – individual database servers will be harvested to create (potentially) a hierarchy of servers used to intelligently route queriesto the databases most like to contain relevant materials.We are also continuing to refine the both the harvesting method and the collection ranking algorithm. We believe that additional collection and collection document statistics may provide a better ranking of results and thus more effective routing of queries.