Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst,...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst,...
Web Search – Summer Term 2006
VII. Selected Topics -Metasearch Engines [1]
(c) Wolfgang Hürst, Albert-Ludwigs-University
Introduction
So far: General-purpose search engines (index and search the whole web)
In addition: Special-purpose search engines exist (index and search documents from a particular domain)
Problem: - Both only cover only a fraction of the web- Users sometimes have to address different
search engines
Possible solution: Metasearch engines
Motivation
Why should we build metasearch engines?
- To increase the coverage of the web
- To solve the scalability of searching the web
- To facilitate the invocation of multiple search engines
- To improve the retrieval effectiveness
A typical metasearch engine session
1. User sends query to metasearch engine
2. Metasearch engine processes query and sends it to component search engines
3. Component search engines return results (based on local similarity) to metasearch engine
4. Metasearch engine merges and ranks returned results (based on global similarities) and presents them to the user
A typical architecture
USER INTERFACE
DATABASE SELECTOR
DOCUMENT SELECTOR
RESULT MERGER
QUERY DISPATCHER
SEARCH ENGINESEARCH ENGINE . . .
Database selection problem
Document selection problem
Result merging problem
Challenges for metasearch engines
Why is it hard to create metasearch engines?
Because of the heterogeneity among the involved autonomous component search engines:- Various indexing methods- Different document term weighting schemes- Different query term weighting schemes- Varying similarity functions- Different document databases- Unequal document versions- Different result representation
A typical architecture
USER INTERFACE
DATABASE SELECTOR
DOCUMENT SELECTOR
RESULT MERGER
QUERY DISPATCHER
SEARCH ENGINESEARCH ENGINE . . .
Database selection problem
Document selection problem
Result merging problem
The database selection problem
Selection is important if the number of component search engines is large
Goal: Identifying as many potentially useful databases as possible while minimizing wrongly identifying useless ones
Decision is usually based on a representative of the database
Most important questions:- What are good representatives?- How do we get them?
Approaches for database selection
Three types of techniques:
- Rough representative approaches
e.g. several key words or paragraphs
- Statistical representative approaches
statistical information (e.g. document frequency of a term)
- Learning-based approaches
learned from past experience (training queries or real (previous) user queries)
Approaches for database selection
Three types of techniques:
- Rough representative approaches
e.g. several key words or paragraphs
- Statistical representative approaches
statistical information (e.g. document frequency of a term)
- Learning-based approaches
learned from past experience (training queries or real (previous) user queries)
Rough representative approaches
Typically several keywords and paragraphs
Only give a rather general idea about the database
Often manually generated
Automatic approaches exist (e.g. taking the text from the interface page or anchor text from pages pointing to the search engine)
Alternatively: Involve the user, e.g. by explicit search engine selection or need to specify a subject area
Advantages: Easy and little storage space requirements, works well for highly specialized component search engines with diversified topics
Disadvantage: Does not work well for databases with documents of diverse interests
Statistical representative approaches
Examples:
- D-WISE approach
- CORI Net approach
- gGIOSS approach
- Estimating the number of potentially useful documents
- Estimating the similarity of the most similar documents
The D-WISE approach
The representative of a component search engine contains- the document frequency of each term in the component search engine (n values dij)- the number of documents in the database (1 value ni)
Ranking the component search engines based on a given query q:
1. Calculate the cue validity Cij of each search term tj
The D-WISE approach (cont.)
2. Calculate the variance CVVj of the CVij's of each query term tj for all component databases
(ACVj = average of all CVij's for all comp. databases)
3. Compute the ranking score ri for each component database i with respect to q
The D-WISE approach (cont.)
4. Select component databases with the highest ranking score
(Intuitive interpretation of the ranking score: Indicates where useful query terms are concentrated)
Advantages of this approach:- easy scalable- easy to compute
Disadvantages:- ranking scores are relative scores (no absolute quality measures)- does not distinguish between multiple term appearances within single documents
The D-WISE approach (cont.)
4. Select component databases with the highest ranking score
(Intuitive interpretation of the ranking score: Indicates where useful query terms are concentrated)
Advantages of this approach:- easy scalable- easy to compute
Disadvantages:- ranking scores are relative scores (no absolute quality measures)- does not distinguish between multiple term appearances within single documents
Learning-based approaches
Predict usefulness of a database for new queries based on the experiences with the database from past ones
How to obtain these retrieval experiences?- Usage of training queries (static learning)- Usage of real user queries and accumulate retrieval knowledge gradually and update continuously (dynamic learning)- Use a combined learning approach, i.e. start with training queries and update continuously based on real user queries
Examples:- The MRDD approach- The SavvySearch approach- The ProFusion approach
The MRDD approach
Static learning approach, i.e. a set of training queries is given and all relevant documents for each training query have to be identified (manually)
Stores a vector reflecting the distribution of the relevant documents: <r1, r2, ..., rs>
with ri = number of top-ranked documents that must be retrieved to obtain i relevant documents
Example:
Retrieval result: (d1, ..., d100)
Relevant documents: d1, d4, d10, d17, d30
Distribution vector: <r1, r2, r3, r4, r5> = <1, 4, 10, 17, 30>
The MRDD approach (cont.)
Application:
Input: query q from the user
1. Compare q to the training queries and select the k most similar ones (e.g. k=8)
2. Calculate the average distribution vector over these k queries for each database
3. Select the databases Di's which maximize precision
Example:
D1: <1, 4, 6, 7, 10, 12, 17>
D2: <3, 5, 7, 9, 15, 20>
D3: <2, 3, 6, 9, 11, 16>
A typical architecture
USER INTERFACE
DATABASE SELECTOR
DOCUMENT SELECTOR
RESULT MERGER
QUERY DISPATCHER
SEARCH ENGINESEARCH ENGINE . . .
Database selection problem
Document selection problem
Result merging problem
The document selection problem
The naive approach (i.e. return all results) won't work because it delivers too many documents
Goal: Retrieve as many potentially useful documents from a component database as possible while minimizing the retrieval of useless documents
Different categories of approaches:
- User determination
- Weighted allocation
- Learning-based approaches
- Guaranteed retrieval
User determination
Let the user select a number for each component database (or use a default number, if none is given by the user)
Advantage:- Works well with small numbers of component databases and if the user is reasonably familiar with them
Disadvantage:- Does not work for larger sets of component databases
Weighted allocation
Use the rank (or ranking score) calculated by the database selection algorithm to specify the number of documents that should be selected from the respective database
Example:
If ri is the ranking score of database Di (i = 1, ..., N) and m documents should be returned to the user, then select
documents from database Di
Learning-based approaches
Idea: Learn how many documents to retrieve for a given query based on past retrieval experiences for similar queries
Example: The MRDD approach (see before) is a learning-based approach that already includes document selection!
Guaranteed retrieval
Idea: Take global similarities into account (not only local ones, e.g. component database dependent document frequencies, etc.)
Example approaches:
- Query modification
- Computing the tightest local threshold
A typical architecture
USER INTERFACE
DATABASE SELECTOR
DOCUMENT SELECTOR
RESULT MERGER
QUERY DISPATCHER
SEARCH ENGINESEARCH ENGINE . . .
Database selection problem
Document selection problem
Result merging problem
Result merging approaches
Goal: Combine all returned results (sorted by local similarity) into one single result (sorted by global similarity)
Why is this a hard problem?
- Local similarities are not provided by all search engines- The heterogeneity of the component search engines make them hard (if not impossible) to compare- Local similarities might differ significantly from global ones- Should documents returned from multiple search engines considered differently?
Result merging approaches (cont.)
Two types of approaches exist:
Local similarity adjustment, e.g.- Adjust local similarities using additional information (e.g. quality of the component database)- Convert local document ranks to similarity
Global similarity estimationAttempts to estimate the true global similarities
Local similarity adjustment
Distinguish three cases:
1. The selected databases (or returned results) are pairwise disjoint (or nearly disjoint)
2. The selected databases (or returned results) overlap but are not identical
3. The selected databases are identical
The latter case is known as data fusion and normally not considered for metasearch engines
In the following: Assume case 1 (disjoint results)
Local similarity adjustment1st: assume local similarities exist
1. Normalize the returned similarities
2. Use database scores to adjust the local similarities (i.e. give higher preference to documents from highly rated databases), e.g.
If s is the ranking score of database D and s is the average of all these scores, then
is the weight for database D (N = no. of databases)
Define the adjusted similarity as w*x (x = original, returned local similarity)
3. Sort results by adjusted similarity
Local similarity adjustment
2nd: assume local similarities do NOT exist
Possible approaches:
- Use the local document rank information directly to perform the merge, e.g.
1. Arrange databases based on their scores
2. Select documents with the round-robin method
(Alternatively: randomized version where the documents are selected based on the probability of the database being relevant)
- Convert the local document ranks to similarities
Local similarity adjustment
Distinguish three cases:
1. The selected databases (or returned results) are pairwise disjoint (or nearly disjoint)
2. The selected databases (or returned results) overlap but are not identical
How to deal with results returned by multiple search engines?
Calculate adjusted similarities as before and use (e.g.) methods from data fusion to combine them. However, this might not work well because of the different coverage of the component databases (-> active research field).
Global similarity estimation
Approaches to estimate global similarities:
1. Document fetching:
Idea: Download returned documents to get (e.g.) term frequencies, estimate global document frequency
Disadvantages: expensive (but remedies exist, e.g. download docs in parallel, keep downloading and analyzing docs while initial results are presented, only download the beginning portion)
Advantages: - Identifies obsolete URLs - Ranking based on current (up-to-date) content - Better result representation
Global similarity estimation
Approaches to estimate global similarities (cont.):
2. Use of discovered knowledge:
Basic idea: Try to figure out the specific document indexing and similarity computation methods used by the different component search engines Use this information to - better compare local similarities - better adjust local similarities to make them comparable with each other - better derive global from local similarities
Global similarity estimation
2. Use of discovered knowledge: Examples
Assume all component search engines use same indexing and local similarity estimation methods that do not include collection-dependent statistics -> local similarities are comparable and can be used directly
Assume the only difference is the usage of (different) lists of stopwords -> modify the query to make results comparable
Assume idf information is also used -> either adjust local similarities or compute global similarities directly
(cont. on next slide)
Global similarity estimation2. Use of discovered knowledge: Examples
Assume idf information is also used -> either adjust local similarities or compute global similarities directly
Case 1: Query q contains just one single term t
Similarity in the component database:
Global similarity: Multiply local sim. with
Case 2: Query q contains multiple terms: see lit.
Other challenges for metasearch engines
1. Integrate local systems employing different indexing techniques
2. Integrate local systems supporting different types of queries (e.g. Boolean vs. vector space queries)
3. Discover knowledge about component search engines
4. Develop more effective result merging methods
5. Study the appropriate cooperation between a metasearch engine and the local systems
6. Incorporate new indexing and weighting techniques to build better metasearch engines
Other challenges for metasearch engines
7. Improve the effectiveness of metasearch
8. Decide where to place the software components of a metasearch engine
9. Create a standard testbed to evaluate the proposed techniques for database selection, document selection, and result merging
10. Extend metasearch techniques to different types of data sources
References
[1] MENG, YU, LIU: BUILDING EFFICIENT AND EFFECTIVE METASEARCH ENGINES. ACM COMPUTING SURVEYS, VOL. 34, NO. 1, MARCH 2002.
INDEX
Recap: IR System & Tasks Involved
INFORMATION NEED DOCUMENTS
User Interface
PERFORMANCE EVALUATION
QUERY
QUERY PROCESSING (PARSING & TERM
PROCESSING)
LOGICAL VIEW OF THE INFORM. NEED
SELECT DATA FOR INDEXING
PARSING & TERM PROCESSING
SEARCHING
RANKING
RESULTS
DOCS.
RESULT REPRESENTATION
General web search engine architecture
CLIENT
QUERY ENGINE
RANKING
CRAWL CONTROL
CRAWLER(S)
USAGE FEEDBACK
RESULTSQUERIES
WWW
COLLECTION ANALYSIS MOD.
INDEXER MODULE
PAGE REPOSITORY
INDEXESSTRUCTUREUTILITY TEXT
(CF. FIG. 1 IN ARASU ET AL. "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, 8-2001, PAGE 4)
Next week
1. Final grading of the programming exercises
2. Exercises with the PageRank simulation tool (participation mandatory, but no grading)
Exam dates
August 28th and 29th OR
September 18th and 19th OR
by personal arrangement