Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst,...

Web Search – Summer Term 2006

VII. Selected Topics -Metasearch Engines [1]

(c) Wolfgang Hürst, Albert-Ludwigs-University

Introduction

So far: General-purpose search engines (index and search the whole web)

In addition: Special-purpose search engines exist (index and search documents from a particular domain)

Problem: - Both only cover only a fraction of the web- Users sometimes have to address different

search engines

Possible solution: Metasearch engines

Motivation

Why should we build metasearch engines?

- To increase the coverage of the web

- To solve the scalability of searching the web

- To facilitate the invocation of multiple search engines

- To improve the retrieval effectiveness

A typical metasearch engine session

1. User sends query to metasearch engine

2. Metasearch engine processes query and sends it to component search engines

3. Component search engines return results (based on local similarity) to metasearch engine

4. Metasearch engine merges and ranks returned results (based on global similarities) and presents them to the user

A typical architecture

USER INTERFACE

DATABASE SELECTOR

DOCUMENT SELECTOR

RESULT MERGER

QUERY DISPATCHER

SEARCH ENGINESEARCH ENGINE . . .

Database selection problem

Document selection problem

Result merging problem

Challenges for metasearch engines

Why is it hard to create metasearch engines?

Because of the heterogeneity among the involved autonomous component search engines:- Various indexing methods- Different document term weighting schemes- Different query term weighting schemes- Varying similarity functions- Different document databases- Unequal document versions- Different result representation


USER INTERFACE

DATABASE SELECTOR

DOCUMENT SELECTOR

RESULT MERGER

QUERY DISPATCHER





The database selection problem

Selection is important if the number of component search engines is large

Goal: Identifying as many potentially useful databases as possible while minimizing wrongly identifying useless ones

Decision is usually based on a representative of the database

Most important questions:- What are good representatives?- How do we get them?

Approaches for database selection

Three types of techniques:

- Rough representative approaches

e.g. several key words or paragraphs

- Statistical representative approaches

statistical information (e.g. document frequency of a term)

- Learning-based approaches

learned from past experience (training queries or real (previous) user queries)

Rough representative approaches

Typically several keywords and paragraphs

Only give a rather general idea about the database

Often manually generated

Automatic approaches exist (e.g. taking the text from the interface page or anchor text from pages pointing to the search engine)

Alternatively: Involve the user, e.g. by explicit search engine selection or need to specify a subject area

Advantages: Easy and little storage space requirements, works well for highly specialized component search engines with diversified topics

Disadvantage: Does not work well for databases with documents of diverse interests

Statistical representative approaches

Examples:

- D-WISE approach

- CORI Net approach

- gGIOSS approach

- Estimating the number of potentially useful documents

- Estimating the similarity of the most similar documents

The D-WISE approach

The representative of a component search engine contains- the document frequency of each term in the component search engine (n values dij)- the number of documents in the database (1 value ni)

Ranking the component search engines based on a given query q:

1. Calculate the cue validity Cij of each search term tj

The D-WISE approach (cont.)

2. Calculate the variance CVVj of the CVij's of each query term tj for all component databases

(ACVj = average of all CVij's for all comp. databases)

3. Compute the ranking score ri for each component database i with respect to q

The D-WISE approach (cont.)

4. Select component databases with the highest ranking score

(Intuitive interpretation of the ranking score: Indicates where useful query terms are concentrated)

Advantages of this approach:- easy scalable- easy to compute

Disadvantages:- ranking scores are relative scores (no absolute quality measures)- does not distinguish between multiple term appearances within single documents

Learning-based approaches

Predict usefulness of a database for new queries based on the experiences with the database from past ones

How to obtain these retrieval experiences?- Usage of training queries (static learning)- Usage of real user queries and accumulate retrieval knowledge gradually and update continuously (dynamic learning)- Use a combined learning approach, i.e. start with training queries and update continuously based on real user queries

Examples:- The MRDD approach- The SavvySearch approach- The ProFusion approach

The MRDD approach

Static learning approach, i.e. a set of training queries is given and all relevant documents for each training query have to be identified (manually)

Stores a vector reflecting the distribution of the relevant documents: <r1, r2, ..., rs>

with ri = number of top-ranked documents that must be retrieved to obtain i relevant documents

Example:

Retrieval result: (d1, ..., d100)

Relevant documents: d1, d4, d10, d17, d30

Distribution vector: <r1, r2, r3, r4, r5> = <1, 4, 10, 17, 30>

The MRDD approach (cont.)

Application:

Input: query q from the user

1. Compare q to the training queries and select the k most similar ones (e.g. k=8)

2. Calculate the average distribution vector over these k queries for each database

3. Select the databases Di's which maximize precision

Example:

D1: <1, 4, 6, 7, 10, 12, 17>

D2: <3, 5, 7, 9, 15, 20>

D3: <2, 3, 6, 9, 11, 16>


USER INTERFACE

DATABASE SELECTOR

DOCUMENT SELECTOR

RESULT MERGER

QUERY DISPATCHER





The document selection problem

The naive approach (i.e. return all results) won't work because it delivers too many documents

Goal: Retrieve as many potentially useful documents from a component database as possible while minimizing the retrieval of useless documents

Different categories of approaches:

- User determination

- Weighted allocation

- Learning-based approaches

- Guaranteed retrieval

User determination

Let the user select a number for each component database (or use a default number, if none is given by the user)

Advantage:- Works well with small numbers of component databases and if the user is reasonably familiar with them

Disadvantage:- Does not work for larger sets of component databases

Weighted allocation

Use the rank (or ranking score) calculated by the database selection algorithm to specify the number of documents that should be selected from the respective database

Example:

If ri is the ranking score of database Di (i = 1, ..., N) and m documents should be returned to the user, then select

documents from database Di

Learning-based approaches

Idea: Learn how many documents to retrieve for a given query based on past retrieval experiences for similar queries

Example: The MRDD approach (see before) is a learning-based approach that already includes document selection!

Guaranteed retrieval

Idea: Take global similarities into account (not only local ones, e.g. component database dependent document frequencies, etc.)

Example approaches:

- Query modification

- Computing the tightest local threshold


USER INTERFACE

DATABASE SELECTOR

DOCUMENT SELECTOR

RESULT MERGER

QUERY DISPATCHER





Result merging approaches

Goal: Combine all returned results (sorted by local similarity) into one single result (sorted by global similarity)

Why is this a hard problem?

- Local similarities are not provided by all search engines- The heterogeneity of the component search engines make them hard (if not impossible) to compare- Local similarities might differ significantly from global ones- Should documents returned from multiple search engines considered differently?

Result merging approaches (cont.)

Two types of approaches exist:

Local similarity adjustment, e.g.- Adjust local similarities using additional information (e.g. quality of the component database)- Convert local document ranks to similarity

Global similarity estimationAttempts to estimate the true global similarities

Local similarity adjustment

Distinguish three cases:

1. The selected databases (or returned results) are pairwise disjoint (or nearly disjoint)

2. The selected databases (or returned results) overlap but are not identical

3. The selected databases are identical

The latter case is known as data fusion and normally not considered for metasearch engines

In the following: Assume case 1 (disjoint results)

Local similarity adjustment1st: assume local similarities exist

1. Normalize the returned similarities

2. Use database scores to adjust the local similarities (i.e. give higher preference to documents from highly rated databases), e.g.

If s is the ranking score of database D and s is the average of all these scores, then

is the weight for database D (N = no. of databases)

Define the adjusted similarity as w*x (x = original, returned local similarity)

3. Sort results by adjusted similarity


2nd: assume local similarities do NOT exist

Possible approaches:

- Use the local document rank information directly to perform the merge, e.g.

1. Arrange databases based on their scores

2. Select documents with the round-robin method

(Alternatively: randomized version where the documents are selected based on the probability of the database being relevant)

- Convert the local document ranks to similarities


Distinguish three cases:

1. The selected databases (or returned results) are pairwise disjoint (or nearly disjoint)

2. The selected databases (or returned results) overlap but are not identical

How to deal with results returned by multiple search engines?

Calculate adjusted similarities as before and use (e.g.) methods from data fusion to combine them. However, this might not work well because of the different coverage of the component databases (-> active research field).

Global similarity estimation

Approaches to estimate global similarities:

1. Document fetching:

Idea: Download returned documents to get (e.g.) term frequencies, estimate global document frequency

Disadvantages: expensive (but remedies exist, e.g. download docs in parallel, keep downloading and analyzing docs while initial results are presented, only download the beginning portion)

Advantages: - Identifies obsolete URLs - Ranking based on current (up-to-date) content - Better result representation


Approaches to estimate global similarities (cont.):

2. Use of discovered knowledge:

Basic idea: Try to figure out the specific document indexing and similarity computation methods used by the different component search engines Use this information to - better compare local similarities - better adjust local similarities to make them comparable with each other - better derive global from local similarities


2. Use of discovered knowledge: Examples

Assume all component search engines use same indexing and local similarity estimation methods that do not include collection-dependent statistics -> local similarities are comparable and can be used directly

Assume the only difference is the usage of (different) lists of stopwords -> modify the query to make results comparable

Assume idf information is also used -> either adjust local similarities or compute global similarities directly

(cont. on next slide)

Global similarity estimation2. Use of discovered knowledge: Examples

Assume idf information is also used -> either adjust local similarities or compute global similarities directly

Case 1: Query q contains just one single term t

Similarity in the component database:

Global similarity: Multiply local sim. with

Case 2: Query q contains multiple terms: see lit.

Other challenges for metasearch engines

1. Integrate local systems employing different indexing techniques

2. Integrate local systems supporting different types of queries (e.g. Boolean vs. vector space queries)

3. Discover knowledge about component search engines

4. Develop more effective result merging methods

5. Study the appropriate cooperation between a metasearch engine and the local systems

6. Incorporate new indexing and weighting techniques to build better metasearch engines

Other challenges for metasearch engines

7. Improve the effectiveness of metasearch

8. Decide where to place the software components of a metasearch engine

9. Create a standard testbed to evaluate the proposed techniques for database selection, document selection, and result merging

10. Extend metasearch techniques to different types of data sources

References

[1] MENG, YU, LIU: BUILDING EFFICIENT AND EFFECTIVE METASEARCH ENGINES. ACM COMPUTING SURVEYS, VOL. 34, NO. 1, MARCH 2002.

INDEX

Recap: IR System & Tasks Involved

INFORMATION NEED DOCUMENTS

User Interface

PERFORMANCE EVALUATION

QUERY

QUERY PROCESSING (PARSING & TERM

PROCESSING)

LOGICAL VIEW OF THE INFORM. NEED

SELECT DATA FOR INDEXING

PARSING & TERM PROCESSING

SEARCHING

RANKING

RESULTS

DOCS.

RESULT REPRESENTATION

General web search engine architecture

CLIENT

QUERY ENGINE

RANKING

CRAWL CONTROL

CRAWLER(S)

USAGE FEEDBACK

RESULTSQUERIES

WWW

COLLECTION ANALYSIS MOD.

INDEXER MODULE

PAGE REPOSITORY

INDEXESSTRUCTUREUTILITY TEXT

(CF. FIG. 1 IN ARASU ET AL. "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, 8-2001, PAGE 4)

Next week

1. Final grading of the programming exercises

2. Exercises with the PageRank simulation tool (participation mandatory, but no grading)

Exam dates

August 28th and 29th OR

September 18th and 19th OR

by personal arrangement

Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst,...

Documents

Transcript of Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst,...