Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been...

Query Suggestion

2

Query Suggestion A variety of automatic or semi-automatic query suggestion

techniques have been developed

Goal is to improve effectiveness by matching related/similar terms

Semi-automatic techniques require user interaction to select best suggested terms

Query expansion is a related technique

Alternative queries, usually offer more terms

3

Query Suggestion Approaches usually based on an analysis of term co-

occurrence

Either in the entire document collection, a large collection of queries, or the top-ranked documents in a result list

Query-based stemming also a suggestion technique

Automatic suggestion based on general thesaurus not effective

Does not take context into account, e.g.,

“aquarium” is a good suggestion for “tank” in the query“tropical fish tank”, but not for “armor for tanks”

4

Term Association Measures Dice’s Coefficient

where stands for rank equivalent

Mutual Information Measure (MIM)

where N is the number of documents in a collection

P(a) = na/N, P(b) = nb/N, P(a, b) = nab/N

=rank

Measures the extent to which words co-occurrence independently

5

Term Association Measures Mutual Information measure (MIM) favors low frequency

terms

Expected Mutual Information Measure (EMIM) addresses the problem of MIM by weighting MIM using P(a,

b)

Actually only 1 part of EMIM focused on word occurrence

EMIM, however, favors high frequency terms

6

Term Association Measures Pearson’s Chi-squared (χ2) measure

Compares the number of co-occurrences of two words with the expected number of co-occurrences

if the two words were independent

Normalizes this comparison by the expected number

Also limited form focused on word co-occurrence

Expected number of co-occurrence if the words occur independently

Favors low-frequency terms

7

Association Measure Summary

8

Association Measure Example

Most strongly associated words for “tropical” in a collection of TREC news stories. Co-occurrence counts are measured at the document level.

Identical ranking &favor low-frequencywords

More generalthan MIM & X2

9


Most strongly associated words for “fish”, a high frequent term,in a collection of TREC news stories.

Similar Top-ranked wordsin MIM & X2

10


Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured in windows of 5 words.

Still favor low-frequencyterms

Most stable& reliableregardlessof thewindowsizes

11

Association Measures Associated words are of little use for expanding the query

“tropical fish”

Expansion based on whole query takes context into account

e.g., using Dice with term “tropical fish” gives the following highly associated words:

goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet

Impractical for all possible queries, other approaches used to achieve this effect

12

Other Approaches Pseudo-relevance feedback

Expansion terms based on top retrieved docs for initial query

Context vectors

Represent words by the words that co-occur with them• e.g., top 35 most strongly associated words for “aquarium”

(using Dice’s coefficient):

Rank words for a query by ranking context vectors

Challenges (computational & accuracy): due to huge size& variability in quality of the collections

13

Other Approaches Query logs

Best source of information about queries & related terms

• short pieces of text & click data

e.g., most frequent words in queries containing “tropical fish” from MSN log:

stores, pictures, live, sale, types, clipart, blue, freshwater, aquarium, supplies

Query suggestion based on finding similar queries

• group based on click data

14

Query Expansion Search engines suggest expanded/alternative queries

in response to a query Q

Using some form of thesaurus to perform global analysis

• For each term t in Q, Q is expanded with synonyms and related words of t from the thesaurus

15

Query Expansion Methods for building a thesaurus for query expansion

1. Use of a controlled vocabulary maintained by human editors, such as the Library of Congress subject headings (LCSH), e.g.,

• The LCSH of “American Revolutionary War” is

United States – History -- Revolution, 1775-1783

2. An automatically derived thesaurus, constructed using word co-occurrence statistics over a collection

of docs

3. Query reformulations based on query log mining by exploring the manual query

reformulations of other users to make suggestions to a user

Thesaurus-based query expansion does not require any user input to increase recall

16

Query Expansion Automatic thesaurus generation using word co-occurrence

A simple approach is based on term-term similarities

• Start with a term-document matrix A, where each cell At,d is a weighted count of wt,d for term t & document d

• Calculate C = AAT in which Cu,v is a similarity score between terms u and v, the larger the number, the better

• An example of a derived thesaurus with good/bad suggestions

17

Query Expansion The quality of term association is typically a problem in

an automatically generated thesaurus

Term ambiguity easily introduces irrelevant statistically correlated terms, such as “Apple” can be

expanded to “Apple red fruit computer”

• Suffer from false positives (FP) and false negatives (FN)

High cost to manually produce and update a thesaurus

Query expansion often increases recall, but may also significantly decease precision , especially

when the query contains ambiguous terms, e.g.,

interest rate interest rate fascinate evaluate

is unlikely to be useful

Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been...

Documents

Transcript of Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been...