A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management &...

A Logistic Regression Approach to Distributed IRA Logistic Regression Approach to Distributed IRRay R. Larson : Ray R. Larson : School of Information Management & Systems, University of California, Berkeley -- ray@sherlock.berkeley.eduSchool of Information Management & Systems, University of California, Berkeley -- ray@sherlock.berkeley.edu

TREC Disk Source Size MB Size doc

1 WSJ (86-89) 270 98,732

1 AP (89) 259 84,678

1 ZIFF 245 75,180

1 FR (89) 262 25,960

2 WSJ (90-92) 247 74.520

2 AP (88) 241 79,919

2 ZIFF 178 56,920

2 FR (88) 211 19,860

3 AP (90) 242 78,321

3 SJMN (91) 290 90,257

3 PAT 245 6,711

Totals 2,690 691,058

TREC Disk Source Num DB Total DB

1 WSJ (86-89) 29 Disk 1

1 AP (89) 12 67

1 ZIFF 14

1 FR (89) 12

2 WSJ (90-92) 22 Disk 2

2 AP (88) 11 54

2 ZIFF 11 (1 dup)

2 FR (88) 10

3 AP (90) 12 Disk 3

3 SJMN (91) 12 116

3 PAT 92

Totals 237 - 1 237 - 1

10),|(

iii XccCQRP

We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time.We calculate the probability of relevance using Logistic regression from a sample set of documents to determine values of the coefficients.At retrieval time the probability of relevance for a particular query Q and a collection C is estimated by:

Probabilistic Retrieval Using Logistic RegressionProbabilistic Retrieval Using Logistic Regression

For the 6 X attribute measures shown below:

log110

Average Absolute Query Frequency

Query Length

Average Absolute Collection Frequency

Collection Length

Average Inverse Collection Frequency

Inverse Collection Frequency (N = Number of collections

M = Number of Terms in common between query and document

iii XccCQRO

10),|(log

)5491.7()2501.6()0053.0()1921.1(

)8669.1()6801.9(9850.20,|logqueries query) TREC (full longFor very

)3612.2()9174.5()0029.0()0695.1(

)8669.1()3188.2(0103.7,|logqueries n)descriptio and (title longFor

.(316.2277)constant a is where)01.2()223.0(

)679.0()310.0()269.1(70.3,|log

queries only) (titleshort For

XXCQRO

KXXXCQRO

The probabilities are actually calculated as the log odds, and converted

The ci coefficients were estimated separately for three query types(during retrieval the length of the query was used to differentiate these.

Test Database CharacteristicsTest Database Characteristics

The ProblemThe Problem

We used collections formed by dividing the documents on TIPSTER disks 1, 2, and 3 into 236 sets based on source and month (using the same contents as in evaluations by Powell & French and Callan). The query set used was TREC queries 51-150. Collection relevance information was based on whether any documents in the collection were relevant according to the relevance judgements for TREC queries 51-150. The relevance information was used both for estimating the logistic regression coefficients (using a sample of the data) and for the evaluation (with full data).

• Hundreds or Thousands of servers with databases ranging widely in content, topic, and format– Broadcast search is expensive in terms of bandwidth

and in processing too many irrelevant results– How to select the “best” ones to search?

• What to search first• Which to search next

– Topical /domain constraints on the search selections– Variable contents of database (metadata only, full

text…)

• Resource Description– How to collect metadata about digital libraries and their collections

or databases• Resource Selection

– How to select relevant digital library collections or databases from a large number of databases

• Distributed Search– How to perform parallel or sequential searching over the selected

digital library databases• Data Fusion

– How to merge query results from different digital libraries with their different search engines, differing record structures, etc.

Distributed IR TasksDistributed IR Tasks

This research was sponsored at U.C. Berkeley and the University of Liverpoolby the National Science Foundation and the Joint Information Systems Committee (UK)

under the International Digital Libraries Program award #IIS-99755164James French and Allison Powell kindly provided the CORI and Ideal(0) results used in

the evaluation.

AcknowledgementsAcknowledgements

MetaSearchServer

Map ExplainAnd ScanQueries

InternetMap

Results

MapQuery

MapResults

SearchEngine

DB2DB 1Map

MapResults

SearchEngine

DB 4DB 3

DistributedIndex

SearchEngine

Db 6Db 5

Our Approach Using Z39.50Our Approach Using Z39.50

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers

– Tested using the collection representatives as harvested from over the network and the TIPSTER relevance judgements

– Testing by comparing our approach to known algorithms for ranking collections– Results (preliminary) were measured against reported results for the Ideal and CORI algorithms and against

the optimal “Relevance Based Ranking” (MAX)– Recall analog (How many of the Rel docs occurred in the top n databases – averaged)

1 9 17 25 33 41 49 57 65 73 81 89 97 105

ProbMaxCORIIdeal()

Very Long QueriesVery Long Queries

Number of Collections

1 9 17 25 33 41 49 57 65 73 81 89 97 105

Ideal()MaxCORIProb

Title QueriesTitle Queries

Long QueriesLong Queries

1 9 17 25 33 41 49 57 65 73 81 89 97 105

MaxCORIIdeal()Prob

Distributed Retrieval Testing and ResultsDistributed Retrieval Testing and Results

CORI RankingCORI Ranking

ranked being databases theof average theis

in wordsofnumber theis ranked being databases ofnumber theis ||

containing databases ofnumber is containing documents ofnumber is

:where6.04.0)|(

0.1||log

5.0||log

/15050

dbcwDB

rcfrdf

ITdbrpDB

cwcwdfdfT

Assume each database has some merit for a given query, q

Given a Baseline ranking B and an estimated (test) ranking E for the evaluated system

Let dbbi and dbei denote the database in the i-th ranked position of rankings B and E

Let Bi = merit(q, dbbi) and Ei = merit(q, dbei)

We can define some measures:

0such that max

Recall Analogs And a Precision Analog

merit. zero-non with ranking estimated in the databases top theoffraction theI.e.,

|)(||}0),(|)({|

dbqmeritETopdbPn

Effectiveness MeasuresEffectiveness Measures

Comparative Evaluation

MetaSearch New approach to building metasearch based on Z39.50 Instead of using broadcast search we are using two Z39.50

Services -- Identification of database metadata using Z39.50 Explain -- Extraction of distributed indexes using Z39.50 SCAN -- Creation of “Collection Documents” using index contentsEvaluation Questions: -- How efficiently can we build distributed indexes? -- How effectively can we choose databases using the index? -- How effective is merging search results from multiple

sources? -- Do Hierarchies of servers (general/meta-topical/individual)

• For all servers (could be a topical subset)…– Get Explain information to find which indexes are supported and the collection statistics.– For each index

• Use SCAN to extract terms and frequency information• Add term + freq + source index + database metadata to the metasearch XML “Collection Document”

– Index collection documents including for retrieval by the above algorithm• Planned Exensions

• Post-Process indexes (especially Geo Names, etc) for special types of data – e.g. create “geographical coverage” indexes

Application and Further ResearchApplication and Further Research

Data Harvesting and Collection Document CreationData Harvesting and Collection Document Creation

The figures to the right summarize our results from the preliminary evaluation. The X axis is the number of collections in the ranking and the Y axis, , is a Recall analog that measures the proportion of the total possible relevant documents that have been accumulated in the top N databases, averaged across all of the queries.The Max line is the optimal results based where thecollections are ranked in order of the number of relevant documents they contain. Ideal(0) is an implementation of the GlOSS ``Ideal''algorithm and CORI is an implementation of Callan's Inference netapproach. The Prob line is the logistic regression method (described to the left). For title queries the described method performs slightly better than the CORI algorithm for up to about 100 collections, where CORI exceeds it. For Long queries our method is virtually identical to CORI, and CORI performs better for Very Long queries. Both CORI and the logistic regression method outperform the Ideal(0) implementation.

Results and DiscussionResults and Discussion

The method described here is being applied to two distributed systems of servers in the UK. The first (the Distributed ArchivesHub will be made up of individual servers containing archival descriptions in the EAD (Encoded Archival Description) DTD.MerseyLibraries.org is a consortium of University and Public libraries in the Merseyside area. In both cases the method describedhere is being used to build a central index to provide efficient distributed search over the various servers. The basic model is shownbelow – individual database servers will be harvested to create (potentially) a hierarchy of servers used to intelligently route queriesto the databases most like to contain relevant materials.We are also continuing to refine the both the harvesting method and the collection ranking algorithm. We believe that additional collection and collection document statistics may provide a better ranking of results and thus more effective routing of queries.

A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management &...

Documents

Transcript of A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management &...

Logistic Regression - krishanpandey.com Regression.pdfCHAPTER 22. LOGISTIC REGRESSION 22.1 Introduction 22.1.1 Difference between standard and logistic regression In regular multiple-regression

Logistic Regression - cs.wellesley.educs.wellesley.edu/~cs305/lectures/6_Logistic_Regression.pdfLogistic Regression Logistic regression is used for classification, not regression!

Binary Logistic Regression Multinomial Logistic Regressionmgormley/courses/10601/slides/lecture10-multi.pdf · Binary Logistic Regression + Multinomial Logistic Regression 1 10-601

Regression Logistic Regression

Logistic Regression

Logistic Regression and Discriminant Analysis · 2018-04-16 · Discriminant Analysis? Logistic Regression . Logistic Regression •Logistic regression builds a predictive model for

Logistic Regression€¦ · Logistic Regression • Combine with linear regression to obtain logistic regression approach: • Learn best weights in • • We know interpret this

Stata for Logistic Regression - people.umass.edu for Logistic Regression.pdfFit a Logistic Regression Model Summary The commands logit and logistic will fit logistic regression models.

Logistic Regression - svivek · Logistic Regression is the discriminative version. This lecture •Logistic regression •Connection to Naïve Bayes •Training a logistic regression

Regression analysis Linear regression Logistic regression.

Logistic regression

Logistic Regression for Distribution Modeling - CLAS Usersusers.clas.ufl.edu/.../logistic-regression_modeling.pdf · Logistic Regression for Distribution Modeling ... logistic regression

Logistic Regression Using SPSS - sites.education.miami.edu...Logistic Regression Using SPSS Overview Logistic Regression - Logistic regression is used to predict a categorical (usually

Binary Logistic Regression - Juan BattleBinary Logistic Regression • The logistic regression model is simply a non-linear transformation of the linear regression. • The logistic

And Logistic Regression Linear and Logistic Regression.

Lecture 20 - Logistic Regression - Statistical Science · Logistic Regression Logistic Regression Logistic regression is a GLM used to model a binary categorical variable using numerical

1 Chapter 16 logistic Regression Analysis. 2 Content Logistic regression Conditional logistic regression Application.

Introduction to Logistic Regression Modeling - minitab.com · Logistic Regression will estimate binary (Cox (1970)) and multinomial (Anderson (1972)) logistic models. Logistic Regression

Teaching Logistic Regression using Ordinary Least … Logistic Regression ... showed how they used logistic regression in the room class to ... Teaching Logistic Regression using Ordinary

Logistic Regression III: Advanced topics Conditional Logistic Regression for Matched Data Conditional Logistic Regression for Matched Data.