How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting ...

39
How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting www.searchtools.com [email protected] UC Berkeley SIMS class 202 September 16, 2004

Transcript of How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting ...

Page 1: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

How Search Engines Work:A Technology Overview

Avi RappoportSearch Tools Consulting

[email protected]

UC Berkeley SIMS class 202September 16, 2004

Page 2: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

2

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Purpose of Search Engines

Helping people find what they’re looking for• Starts with an “information need”• Convert to a query• Gets results

In the materials available• Web pages• Other formats• Deep Web

Page 3: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

3

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search is Not a Panacea

Search can’t find what’s not there• The content is hugely important

Information Architecture is vitalUsable sites have good navigation

and structure

Page 4: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

4

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search Looks Simple

Page 5: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

5

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

But It's Not

Index ahead of time• Find files or records• Open each one and read it • Store each word in a searchable index

Provide search forms• Match the query terms with words in the

index• Sort documents by relevance

Display results

Page 6: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

6

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search Processing

Page 7: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

7

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

content

search functionali

ty

user interfac

e

Search is Mostly Invisible

Like an iceberg,2/3 below water

Page 8: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

8

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Text Search vs. Database Query

Text search works for structured content

Keyword search vs. SQL queriesApproximate vs. exact matchMultiple sources of contentResponse time and database resourcesRelevance ranking, very importantWorks in the real world (e.g. EBay)

Page 9: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

9

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search is Only as Good as the Content

Users blame the search engine • Even when the content is unavailable

Understand the scope of site or intranet• Kinds of information• Divided sites: products / corporate info• Dates• Languages• Sources and data silos: CMSs, databases...• Update processes

Page 10: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

10

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Making a Searchable Index

Store text to search it laterMany ways to gather text

• Crawl (spider) via HTTP• Read files on file servers• Access databases (HTTP or API)• Data silos via local APIs• Applications, CMSs, via Web Services

Security and Access Control

Page 11: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

11

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Robot Indexing Diagram

Source:James Ghaphery, VCU

Page 12: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

12

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

What the Index Needs

Basic information for document or record• File name / URL / record ID• Title or equivalent• Size, date, MIME type

Full text of item More metadata

• Product name, picture ID• Category, topic, or subject• Other attributes, for relevance ranking and

display

Page 13: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

13

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Simple Index Diagram

Page 14: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

14

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

More Complex Index Processing

Page 15: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

15

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Index Issues

StopwordsStemmingMetadata

• Explicit (tags)• Implicit (context)

Semantics• CMS and Database fields• XML tags and attributes

Page 16: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

16

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search Query Processing

What happens after you click the search button, and before retrieval starts.

Usually in this order• Handle character set, maybe language• Look for operators and organize the query• Look for field names or metadata• Extract words (just like the indexer)• Deal with letter casing

Page 17: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

17

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search and Retrieval

Retrieval: find files with query termsNot the same as relevance ranking

Recall: find all relevant items

Precision: find only relevant items

Increasing one decreases the other

Page 18: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

18

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Retrieval = Matching

Single-word queries• Find items containing that word

Multi-word queries: combine lists• Any: every item with any query word• All: only items with every word• Phrases: find only items with all words in

orderBoolean and complex queries

• Use algorithm to combine lists

Page 19: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

19

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Why Searches Fail

Empty searchNothing on the site on that topic

(scope)Misspelling or typing mistakesVocabulary differencesRestrictive search defaultsRestrictive search choicesSoftware failure

Page 20: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

20

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

LII.org No-Matches Page

Page 21: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

21

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Relevance Ranking

Theory: sort the matching items, so the most relevant ones appear first

Can't really know what the user wants Relevance is hard to define and situationalShort queries tend to be deeply ambiguous

• What do people mean when they type “bank”?First 10 results are the most importantThe more transparent, the better

Page 22: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

22

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Relevance Processing

Sorting documents on various criteriaStart with words matching query termsCitation and link analysis

• Like old library Citation Indexes• Ted Nelson - not only hypertext, but the

links• Google PageRank

• Incoming links• Authority of linkers

Taxonomies and external metadata

Page 23: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

23

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

TF-IDF Ranking Algorithm

Term frequency in the itemInverse document frequency of term

• Rare words are likely to be more importantwij = weight of Term Tj in Document

Di

tfij = frequency of Term Tj in Document Dj

N = number of Documents in collectionn = number of Documents where term Tj occurs at least once

From Salton 1989

Page 24: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

24

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Other Algorithms

Vector space Probabilistic (binary interdependence) Fuzzy set theory Bayesian statistical analysis Latent semantic indexing Neural networks Machine learning All require sophisticated queries See MIR, chapter 2

Page 25: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

25

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Relevance Heuristics

Heuristics are rules of thumb• Not algorithms, not math

Search Relevance Ranking Heuristics• Documents containing all search words• Search words as a phrase• Matches in title tag• Matches in other metadata

Based on real-word user behavior

Page 26: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

26

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search Results Interface

What users see after they click the Search button

The most visible part of searchElements of the results page

• Page layout and navigation• Results header• List of results items• Results footer

Page 27: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

27

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Many Experiments in Interface

Page 28: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

28

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Back to Simplicity

Page 29: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

29

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search Suggestions (aka Best Bets)

Human judgment beats algorithmsGreat for frequent, ambiguous searches

• Use search log to identify best candidatesRecommend good starting pages

• Product information, FAQs, etc.

Requires human resources• That means money and time

More static than algorithmic search

Page 30: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

30

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

MSU Keywords

Page 31: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

31

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Siemens Results

Page 32: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

32

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Cooks.com Results

Page 33: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

33

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Salon.com Results

Page 34: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

34

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Faceted Metadata Search & Browse

Leverage content structure• database fields (i.e. cruise amenities)• document metadata (news article bylines)

Provide both search and browse• Support information foraging• Integrate navigation with results• Not just subject taxonomies• Display only fruitful paths, no dead ends

Supported by academic research• Marti Hearst, UCB SIMS, flamenco.berkeley.edu

Page 35: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

35

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Faceted Search: Information

Page 36: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

36

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Faceted Search: Online Catalog

Page 37: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

37

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search Metrics and Analytics

Metrics• Number of searches• Number of no-matches searches• Traffic from search to high-value pages• Relate search changes to other metrics

Search Log Analysis• Top 5% searches: phrases and words• Top no-matches searches

• Use as market research

Page 38: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

38

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search Will Never Be Perfect

Search engines can’t read minds• User queries are short and ambiguous

Some things will help• Design a usable interface • Show match words in context• Keep index current and complete• Adjust heuristic weighting• Maintain suggestions and synonyms• Consider faceted metadata search

Page 39: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting  consult1@searchtools.com UC Berkeley SIMS class.

39

UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

Search Engines, sorta Rocket Science

Questions and discussionContact me

[email protected]• www.searchtools.com 

This presentation: • www.searchtools.com/slides/sims/202-04

/