IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

IS530 Lesson 12

Boolean vs. Statistical Retrieval Systems

Boolean or Statistical?

Most web search engines default to statistical, use Boolean for advanced

Most proprietary online systems default to Boolean, use statistical for alternative

Statistical search engine vs. relevance ranking of Boolean results

Web Search Engines

Databases generated by robotic programs

(non-human)

spiders, wanderers, web walkers, agents

Full-text indexing of website contents

Supports advanced, complex search

strategies

3 Parts of a Web Search Engine

1. Spider or web-crawler reads webpage, follows links

2. Index catalogs webpages read by spider

3. Search engine software matches queries

lists most relevant site first

3 Parts of an Online System

1) Database building software (dataware)

(follows rules with known fields)2)Index/dictionary file(list of all words and sometimes

phrases in the indexed fields)3) Search engine software(matches queries; Boolean or

statistical; LIFO or relevant

Boolean Operators

AND limits search decreases hits increases precision

OR expands search increases precision decreases hits

NOT limits search seldom used too strong

Proximity Operators Adj, (N)ear, (W)ith

limit a search increase precision

Command Interface Boolean Searching (Westlaw)

Find information about the assumption of risk involving people who fall after slipping in wintery conditions.

assum! /5 risk / p (ic* or snow****) /p (slip! or fell or fall***)

Natural Language and Relevance Ranking (WIN) I need information on

assumption of risk involving a person who has fallen on ice or snow.

Non-Boolean Retrieval Systems

Statistical (associative, probabilistic, or relevance systems)

Linguistic (semantic)

Statistical Retrieval Systems

Incorporate relevance ranking

May incorporate relevance feedback

May have natural language interface

Almost all web search engines use

Algorithm

Latin algorismus, after al-KhwArizmi

Arabian mathematician (AD 825)

Step-by-step procedure for solving

mathematical problems Merriam-Webster http://www.m-w.com/

Statistical search engines use weighting

algorithms to compute relevance

Statistical Search Engines

Weighting algorithms are proprietary

Search engines differ in how they assign

weights and compute relevance ranking

Search results differ

studies found only about 40% overlap

Statistical Web Retrieval Factors

Popularity, # other sites that link to a site authoritative sites given heavier weight

Google

Meta-tags may boost ranking Inktomi/Overture

Direct hit may boost ranking HotBot

Linguistic Retrieval System

Natural Language & Relevance

Ranking

WIN - (Westlaw Is Natural) has some elements

I need information on assumption of risk

involving a person who has fallen on ice or

snow.

WIN Steps

1. Enter query in plain English

2. System removes stop phrases

3. Matches legal phrases from thesaurus,

adjusts weighting

4. Removes stop words

WIN Steps (cont.)

5. Stemming

6. Searches database indexes in OR

relationship

7. Statistical comparison applied

8. Results placed in ranked order

Factors in Determining Relevance

Proximity of query words to each other

Position of query words keywords in title rank higher keyword in headline or near top

Relative length of document

(“normalization”)

Stemming

Factors in Determining Relevance (cont.)

Ignore very frequent terms

Inverse term frequency

Relevance feedback

Stop words

Query expansion/thesaurus

Features Users Can Control

Designating “bound phrases”

Flagging terms that must be present*

Specifying truncat?

Indicating (synonym groups)

Synonym dictionaries

Web Sites that list search engines and features:

www.pandia.comwww.searchenginewatch.comhttp://notess.com

IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

Documents

Transcript of IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems