INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The...

30
INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United St See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Transcript of INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The...

Page 1: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

INFM 700: Session 9

Search (Part II)Search Engines in Information Architecture

Paul JacobsThe iSchoolUniversity of Maryland

Wednesday, Apr. 18, 2012

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Page 2: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Today’s Topics Very short recap

Fundamentals of information retrieval Search engines in practice (web search and web sites)

Issues and tricks Stemming/word issues Query formulation/expansion/assistance Tagging/structuring Others

Deploying search – what we get to do, and howIssues and Tricks

Deploying Search

Page 3: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Vector Space Model

Assumption: Documents that are “close together” in vector space “talk about” the same things

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Page 4: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Term Weighting Term weights consist of two components

Local: how important is the term in this doc? Global: how important is the term in the collection?

Here’s the intuition: Terms that appear often in a document should get high

weights Terms that appear in many documents should get low

weights

How do we capture this mathematically? Term frequency (local) Inverse document frequency (global)

Page 5: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

TF.IDF Term Weighting

ijiji n

Nw logtf ,,

jiw ,

ji ,tf

N

in

weight assigned to term i in document j

number of occurrence of term i in document j

number of documents in entire collection

number of documents with term i

Page 6: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Summary thus far… Represent documents (and queries) as “bags of

words” (terms)

Derive term weights based on frequency

Use weighted term vectors for each document, query

Compute a vector-based similarity score

Display sorted, ranked results

Page 7: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Issues and Tricks What’s a word/term?

We can ignore words (“stop words”), combine (phrases), split up (“stem”) words

Other special treatment (e.g. names, categories)

Query formulation/suggestion

Type of information need

Popularity Based on link analysis/page rank Based on click through, other

Structuring and tagging (e.g., “best bets”)

Issues and Tricks

Deploying Search

Page 8: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Issues and Tricks (cont’d) Thesaurus/query expansion

Based on meaning, conceptual relationships Based on decomposition/type

User feedback/”More like this”

Clustering/grouping of results

Issues and Tricks

Deploying Search

Page 9: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Morphological Variation Handling morphology: related concepts have

different forms Inflectional morphology: same part of speech

Derivational morphology: different parts of speech

Different morphological processes: Prefixing Suffixing Infixing Reduplication

dogs = dog + PLURAL

broke = break + PAST

destruction = destroy + ion

researcher = research + er

Issues and Tricks

Deploying Search

Page 10: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Stemming Dealing with morphological variation: index stems

instead of words Stem: a word equivalence class that preserves the

central concept

How much to stem? organization organize organ? resubmission resubmit/submission submit? reconstructionism?

Issues and Tricks

Deploying Search

Page 11: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Does Stemming Work? Generally, yes! (in English)

Helps more for longer queries, fewer results Lots of work done in this area

But used very sparingly in web search – why?

Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15.

Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993.

David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84.

And others…

Issues and Tricks

Deploying Search

Page 12: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Beyond Words… Stemming/tokenization = specific instance of a

general problem: what is it?

Other units of indexing Concepts (e.g., from WordNet) Named entities Relations …

Issues and Tricks

Deploying Search

Page 13: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Some Observations Search engine fundamentals are very similar

There are many tricks, differences beyond the basic model

Differences appear differently, and are magnified as we get to sites, specific applications

So, as we get to deployment … Be skeptical Test rigorously Some small things can make a big difference

Issues and Tricks

Deploying Search

Page 14: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Deployment - Overview What we can control

Basic process of setting up/using search in IA

Key parameters/issues What to search/organization content Testing and improving results Presentation/interfaces

Issues and Tricks

Deploying Search

Page 15: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

What we control (the IA part)? Requirements and search engine selection

Developing search requirements Build vs. buy Vendor evaluation/selection Consultants?

Content selection What to search/zones/etc. Tags

Search engine configuration Zones, what gets indexed, sometimes how Number of results, sometimes recall vs. precision Others (very often interface-related)

Interfaces

Issues and Tricks

Deploying Search

Page 16: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Search Engine Selection Commercial examples

Autonomy (including the former Verity, Ultraseek, . . .) Google (site search, search appliance) Thunderstone

Build your own, open source? Lucene

Defining requirements Basic search – how big, type of documents, what sort of

interface, metadata, parametric? Advanced requirements – automatic tagging, alerts,

“more like this” Customization and improvement using logs Keep it focused?

Issues and Tricks

Deploying Search

Page 17: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Search Engine Selection (con’d) Pitfalls to avoid

“Getting a bargain” Getting it “free” Great sales reps

Good ideas Get case studies, talk to references Get a “proof of concept period”

Issues and Tricks

Deploying Search

Page 18: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Simple Requirements Matrix

Issues and Tricks

Deploying Search

Vendor Name

Requirement/Criterion Priority Rating Comments

1. Identify Early Warnings/Search1.a. Highly detailed information needs 11.b. Date range restrictions 11.c. Company name restrictions 11.d. Alias/equivalence (e.g The Walt Disney Company = Disney) 21.e. Ability to assign unique IDs (e.g. Disney = NYSE:DIS) 21 f. Restrict/search by subject area/topic 21.g. Ability to partition/segment articles with multiple topics 31.h. Federated search w/web content, Nexis, etc. 21.i. Use of extended lists (e.g. lists of companies, subjects) 2

2. Identify Early Warnings/Alerts2.a. Highly detailed information needs (all of i-h above) 12.b. Controlling/weighting specific elements 22.c. Recall/precision tradeoff 32.d. Identify "new" and "hot" articles that match user's interest 22.e. Sentiment analysis component 3

3. Identify Early Warnings/Discovery3.a. Classify documents in pre-defined or user-defined categories 33.b. Document clustering 33.c. Identification of trends/issues 23.d. Other discovery tools 3

4. Integration and interface requirements

Page 19: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Content Selection (What to Search) Generally, search everything but …

Be leery about providing “search the web” option Use zones or separate text databases for

frequent/infrequent information needs Be careful about outdated/deleted content Make sure “best bets” come to the top

Use logs, test & improve

Issues and Tricks

Deploying Search

Page 20: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Testing and Improvement

Keep track of queries (and results, if possible) using logs If logs are not available, try user experiments If results are not available, get them Relevance/correct judgments; quantitative (e.g.

recall/precision) scores are, too

How to improve Focus on most frequent (important?) requests (90-10 or

80-20) “Best bets” Content manipulation (e.g., adding tags) Thesaurus Keep testing

Issues and Tricks

Deploying Search

Page 21: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

“Best Bets” – How to Implement

Identify desired result page

Determine possible query strings (from logs)

Tag meta-data in documents with query string

Configure search interface (e.g., to show Best Best first, what to do about multiple Best Bets)

This is a special case of using tag field (e.g., keywords, categories, description)

Issues and Tricks

Deploying Search

Page 22: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Designing a Search Interface

The Box (size, position, labels)

Content selection (defaults, radio buttons or pull-down selection)

Parameters or advanced search (Booleans, separate zones, other possibilities)

Issues and Tricks

Deploying Search

Page 23: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Designing a Search Interface - Results Number of results to display

Recall/precision tradeoff?

Snippet/summary information for each hit

Layout of best bits/other hits

Repetition of the query

“No results” – other possible tips

Iteration and refinement

Other (e.g., scores, clusters, …)

Issues and Tricks

Deploying Search

Page 24: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Some example sites

Issues and Tricks

Deploying Search

www.hp.comwww.dell.comwww.ecoearth.infowww.washingtonpost.comwww.dailygazette.comwww.friendsofrockcreek.orgwww.cbf.orgwww.umd.edu

Page 25: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Integrating Search and Browsing Provide more navigation for common needs

…based on search logs, other info

Redirect from search results to navigation

Faceted browsing

. . .

Page 26: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Faceted Browsing Example

Issues and Tricks

Deploying Search

Page 27: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Faceted Browsing Example

Issues and Tricks

Deploying Search

Page 28: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Faceted Browsing Example

Demo: http://flamenco.berkeley.edu/demos.html

Issues and Tricks

Deploying Search

Page 29: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Advantages of Facets Integrates searching and browsing

Easy to build complex queries

Easy to narrow, broaden, shift focus

Helps users avoid getting lost

Helps to prevent “categorization wars”

Issues and Tricks

Deploying Search

Page 30: INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

iSchool

Recap Search is an IA issue!

Quality of search results/user experience depends on: Understanding how search engines work Choosing and deploying carefully Constant testing and improvement Time

Tremendous range of parameters/interface choices

Integrating search and browsing/navigation is a very good idea

Issues and Tricks

Deploying Search