Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been...
-
Upload
moses-foster -
Category
Documents
-
view
219 -
download
0
Transcript of Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been...
2
Query Suggestion A variety of automatic or semi-automatic query suggestion
techniques have been developed
Goal is to improve effectiveness by matching related/similar terms
Semi-automatic techniques require user interaction to select best suggested terms
Query expansion is a related technique
Alternative queries, usually offer more terms
3
Query Suggestion Approaches usually based on an analysis of term co-
occurrence
Either in the entire document collection, a large collection of queries, or the top-ranked documents in a result list
Query-based stemming also a suggestion technique
Automatic suggestion based on general thesaurus not effective
Does not take context into account, e.g.,
“aquarium” is a good suggestion for “tank” in the query“tropical fish tank”, but not for “armor for tanks”
4
Term Association Measures Dice’s Coefficient
where stands for rank equivalent
Mutual Information Measure (MIM)
where N is the number of documents in a collection
P(a) = na/N, P(b) = nb/N, P(a, b) = nab/N
=rank
Measures the extent to which words co-occurrence independently
5
Term Association Measures Mutual Information measure (MIM) favors low frequency
terms
Expected Mutual Information Measure (EMIM) addresses the problem of MIM by weighting MIM using P(a,
b)
Actually only 1 part of EMIM focused on word occurrence
EMIM, however, favors high frequency terms
6
Term Association Measures Pearson’s Chi-squared (χ2) measure
Compares the number of co-occurrences of two words with the expected number of co-occurrences
if the two words were independent
Normalizes this comparison by the expected number
Also limited form focused on word co-occurrence
Expected number of co-occurrence if the words occur independently
Favors low-frequency terms
8
Association Measure Example
Most strongly associated words for “tropical” in a collection of TREC news stories. Co-occurrence counts are measured at the document level.
Identical ranking &favor low-frequencywords
More generalthan MIM & X2
9
Association Measure Example
Most strongly associated words for “fish”, a high frequent term,in a collection of TREC news stories.
Similar Top-ranked wordsin MIM & X2
10
Association Measure Example
Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured in windows of 5 words.
Still favor low-frequencyterms
Most stable& reliableregardlessof thewindowsizes
11
Association Measures Associated words are of little use for expanding the query
“tropical fish”
Expansion based on whole query takes context into account
e.g., using Dice with term “tropical fish” gives the following highly associated words:
goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet
Impractical for all possible queries, other approaches used to achieve this effect
12
Other Approaches Pseudo-relevance feedback
Expansion terms based on top retrieved docs for initial query
Context vectors
Represent words by the words that co-occur with them• e.g., top 35 most strongly associated words for “aquarium”
(using Dice’s coefficient):
Rank words for a query by ranking context vectors
Challenges (computational & accuracy): due to huge size& variability in quality of the collections
13
Other Approaches Query logs
Best source of information about queries & related terms
• short pieces of text & click data
e.g., most frequent words in queries containing “tropical fish” from MSN log:
stores, pictures, live, sale, types, clipart, blue, freshwater, aquarium, supplies
Query suggestion based on finding similar queries
• group based on click data
14
Query Expansion Search engines suggest expanded/alternative queries
in response to a query Q
Using some form of thesaurus to perform global analysis
• For each term t in Q, Q is expanded with synonyms and related words of t from the thesaurus
15
Query Expansion Methods for building a thesaurus for query expansion
1. Use of a controlled vocabulary maintained by human editors, such as the Library of Congress subject headings (LCSH), e.g.,
• The LCSH of “American Revolutionary War” is
United States – History -- Revolution, 1775-1783
2. An automatically derived thesaurus, constructed using word co-occurrence statistics over a collection
of docs
3. Query reformulations based on query log mining by exploring the manual query
reformulations of other users to make suggestions to a user
Thesaurus-based query expansion does not require any user input to increase recall
16
Query Expansion Automatic thesaurus generation using word co-occurrence
A simple approach is based on term-term similarities
• Start with a term-document matrix A, where each cell At,d is a weighted count of wt,d for term t & document d
• Calculate C = AAT in which Cu,v is a similarity score between terms u and v, the larger the number, the better
• An example of a derived thesaurus with good/bad suggestions
17
Query Expansion The quality of term association is typically a problem in
an automatically generated thesaurus
Term ambiguity easily introduces irrelevant statistically correlated terms, such as “Apple” can be
expanded to “Apple red fruit computer”
• Suffer from false positives (FP) and false negatives (FN)
High cost to manually produce and update a thesaurus
Query expansion often increases recall, but may also significantly decease precision , especially
when the query contains ambiguous terms, e.g.,
interest rate interest rate fascinate evaluate
is unlikely to be useful