9/4/2001Information Organization and Retrieval Introduction to Information Retrieval University of...
-
date post
20-Dec-2015 -
Category
Documents
-
view
222 -
download
1
Transcript of 9/4/2001Information Organization and Retrieval Introduction to Information Retrieval University of...
9/4/2001 Information Organization and Retrieval
Introduction to Information Retrieval
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
Lecture authors: Marti Hearst & Ray Larson
9/4/2001 Information Organization and Retrieval
Review: Information Overload
• “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman)
• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)
9/4/2001 Information Organization and Retrieval
Information Organization and Retrieval
• To organize is to (1) furnish with organs, make organic, make into living tissue, become organic; (2) form into an organic whole; give orderly structure to; frame and put into working order; make arrangements for.
• Knowledge is knowing, familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known.
• To retrieve is to (1) recover by investigation or effort of memory, restore to knowledge or recall to mind; regain possession of; (2) rescue from a bad state, revive, repair, set right.
• Information is (1) informing, telling; thing told, knowledge, items of knowledge, news.
The Oxford English Dictionary, cf. Rowley
9/4/2001 Information Organization and Retrieval
Information Life CycleCreation
Utilization Searching
Active
Inactive
Semi-Active
Retention/Mining
Disposition
Discard
Using Creating
AuthoringModifying
OrganizingIndexing
StoringRetrieval
DistributionNetworking
AccessingFiltering
Note: This version of the Life cycle is based on the report of a conference on the Social Aspects of Digital Libraries held at UCLA. - C. Borgman, PI
9/4/2001 Information Organization and Retrieval
Authoring/Modifying
• Converting Data+Information+Knowledge to New Information.
• Creating information from observation, thought.
• Editing and Publication.
• Gatekeeping
9/4/2001 Information Organization and Retrieval
Organizing/Indexing
• Collecting and Integrating information.
• Affects Data, Information and Metadata.
• “Metadata” Describes data and information.– More on this later.
• Organizing Information.– Types of organization?
• Indexing
9/4/2001 Information Organization and Retrieval
Storing/Retrieving
• Information Storage – How and Where is Information stored?
• Retrieving Information.– How is information recovered from storage– How to find needed information– Linked with Accessing/Filtering stage
9/4/2001 Information Organization and Retrieval
Distribution/Networking
• Transmission of information– How is information transmitted?
• Networks vs Broadcast.
9/4/2001 Information Organization and Retrieval
Accessing/Filtering
• Using the organization created in the O/I stage to:– Select desired (or relevant) information– Locate that information– Retrieve the information from its storage
location (often via a network)
9/4/2001 Information Organization and Retrieval
Using/Creating
• Using Information.
• Transformation of Information to Knowledge.
• Knowledge to New Data and New Information.
9/4/2001 Information Organization and Retrieval
Key issues in this course• How to find the appropriate information resources
or information-bearing objects for someone’s (or your own) needs.– Retrieving
• How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them.– Organizing
9/4/2001 Information Organization and Retrieval
Key IssuesCreation
Utilization Searching
Active
Inactive
Semi-Active
Retention/Mining
Disposition
Discard
Using Creating
AuthoringModifying
OrganizingIndexing
StoringRetrieval
DistributionNetworking
AccessingFiltering
9/4/2001 Information Organization and Retrieval
This Week
• Introduction to IR– Modern IR textbook topics
• The Information Seeking Process
9/4/2001 Information Organization and Retrieval
Textbook Topics
9/4/2001 Information Organization and Retrieval
Mor
e D
etai
led
Vie
w
9/4/2001 Information Organization and Retrieval
Wha
t We’
ll C
over
A Lot
A Little
9/4/2001 Information Organization and Retrieval
Search and RetrievalOutline of Part I of SIMS 202
• The Search Process• Information Retrieval Models• Content Analysis/Zipf Distributions• Evaluation of IR Systems
– Precision/Recall– Relevance– User Studies
• System and Implementation Issues• Web-Specific Issues• User Interface Issues• Special Kinds of Search
9/4/2001 Information Organization and Retrieval
What is an Information Need?
9/4/2001 Information Organization and Retrieval
The Standard Retrieval Interaction Model
9/4/2001 Information Organization and Retrieval
Standard Model
• Assumptions:– Maximizing precision and recall
simultaneously– The information need remains static– The value is in the resulting document set
9/4/2001 Information Organization and Retrieval
Problem with Standard Model:
• Users learn during the search process:– Scanning titles of retrieved documents– Reading retrieved documents– Viewing lists of related topics/thesaurus terms– Navigating hyperlinks
• Some users don’t like long disorganized lists of documents
9/4/2001 Information Organization and Retrieval
IR is an Iterative Process
Repositories
Workspace
Goals
9/4/2001 Information Organization and Retrieval
IR is a Dialog
– The exchange doesn’t end with first answer
– User can recognize elements of a useful answer
– Questions and understanding changes as the process
continues.
9/4/2001 Information Organization and Retrieval
“Berry-Picking” as an Information Seeking Strategy (Bates 90)
• Standard IR model– assumes the information need remains the same
throughout the search process
• Berry-picking model– interesting information is scattered like berries
among bushes– the query is continually shifting
9/4/2001 Information Organization and Retrieval
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory
completion of research related to an information need.” (after Bates 89)
Q0
Q1
Q2
Q3
Q4
Q5
9/4/2001 Information Organization and Retrieval
Berry-picking model (cont.)
• The query is continually shifting
• New information may yield new ideas and new directions
• The information need– is not satisfied by a single, final retrieved set– is satisfied by a series of selections and bits of
information found along the way.
9/4/2001 Information Organization and Retrieval
Berry-picking model (cont.)
• The query is continually shifting
• New information may yield new ideas and new directions
• The information need– is not satisfied by a single, final retrieved set– is satisfied by a series of selections and bits of
information found along the way.
9/4/2001 Information Organization and Retrieval
Information Seeking Behavior
• Two parts of a process:• search and retrieval
• analysis and synthesis of search results
• This is a fuzzy area; we will look at several different working theories.
9/4/2001 Information Organization and Retrieval
Search Tactics and Strategies
• Search Tactics– Bates 79
• Search Strategies– Bates 89– O’Day and Jeffries 93
9/4/2001 Information Organization and Retrieval
Tactics vs. Strategies
• Tactic: short term goals and maneuvers– operators, actions
• Strategy: overall planning– link a sequence of operators together to achieve
some end
9/4/2001 Information Organization and Retrieval
Information Search Tactics (after Bates 79)
• Monitoring tactics– keep search on track
• Source-level tactics– navigate to and within sources
• Term and Search Formulation tactics– designing search formulation
– selection and revision of specific terms within search formulation
9/4/2001 Information Organization and Retrieval
Term Tactics
• Move around the thesaurus– superordinate, subordinate, coordinate – neighbor (semantic or alphabetic)– trace -- pull out terms from information already
seen as part of search (titles, etc)– morphological and other spelling variants– antonyms (contrary)
9/4/2001 Information Organization and Retrieval
Source-level Tactics• “Bibble”:
– look for a pre-defined result set – e.g., a good link page on web
• Survey:– look ahead, review available options– e.g., don’t simply use the first term or first source that
comes to mind
• Cut:– eliminate large proportion of search domain– e.g., search on rarest term first
9/4/2001 Information Organization and Retrieval
Source-level Tactics (cont.)• Stretch
– use source in unintended way
– e.g., use patents to find addresses
• Scaffold– take an indirect route to goal
– e.g., when looking for references to obscure poet, look up contemporaries
• Cleave– binary search in an ordered file
9/4/2001 Information Organization and Retrieval
Monitoring Tactics(strategy-level)• Check
– compare original goal with current state
• Weigh– make a cost/benefit analysis of current or anticipated
actions
• Pattern– recognize common strategies
• Correct Errors• Record
– keep track of (incomplete) paths
9/4/2001 Information Organization and Retrieval
Additional Considerations(Bates 79)
• Add a Sort tactic!• More detail is needed about short-term
cost/benefit decision rule strategies• When to stop?
– How to judge when enough information has been gathered?
– How to decide when to give up an unsuccesful search?
– When to stop searching in one source and move to another?
9/4/2001 Information Organization and Retrieval
Implications
• Interfaces should make it easy to store intermediate results
• Interfaces should make it easy to follow trails with unanticipated results
• Makes evaluation more difficult.
9/4/2001 Information Organization and Retrieval
• Later in the course:– More on Search Process and Strategies– User interfaces to improve IR process– Incorporation of Content Analysis into better
systems
9/4/2001 Information Organization and Retrieval
Restricted Form of the IR Problem
• The system has available only pre-existing, “canned” text passages.
• Its response is limited to selecting from these passages and presenting them to the user.
• It must select, say, 10 or 20 passages out of millions or billions!
9/4/2001 Information Organization and Retrieval
Information Retrieval
• Revised Task Statement:
Build a system that retrieves documents that users are likely to find relevant to their queries.
• This set of assumptions underlies the field of Information Retrieval.
9/4/2001 Information Organization and Retrieval
Some IR History
– Roots in the scientific “Information Explosion” following WWII
– Interest in computer-based IR from mid 1950’s• H.P. Luhn at IBM (1958)
• Probabilistic models at Rand (Maron & Kuhns) (1960)
• Boolean system development at Lockheed (‘60s)
• Vector Space Model (Salton at Cornell 1965)
• Statistical Weighting methods and theoretical advances (‘70s)
• Refinements and Advances in application (‘80s)• User Interfaces, Large-scale testing and application (‘90s)
9/4/2001 Information Organization and Retrieval
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
9/4/2001 Information Organization and Retrieval
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
9/4/2001 Information Organization and Retrieval
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
9/4/2001 Information Organization and Retrieval
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
9/4/2001 Information Organization and Retrieval
Relevance (introduction)• In what ways can a document be relevant to a
query?– Answer precise question precisely.
– Who is buried in grant’s tomb? Grant.
– Partially answer question.– Where is Danville? Near Walnut Creek.
– Suggest a source for more information.– What is lymphodema? Look in this Medical Dictionary.
– Give background information.– Remind the user of other knowledge.– Others ...
9/4/2001 Information Organization and Retrieval
Query Languages
• A way to express the question (information need)
• Types: – Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)
9/4/2001 Information Organization and Retrieval
Simple query language: Boolean
– Terms + Connectors (or operators)– terms
• words• normalized (stemmed) words• phrases• thesaurus terms
– connectors• AND• OR• NOT
9/4/2001 Information Organization and Retrieval
Boolean Queries• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
9/4/2001 Information Organization and Retrieval
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations works:
• Cat x x x x• Dog x x x x x• Collar x x x x• Leash x x x x
9/4/2001 Information Organization and Retrieval
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations work:
• Cat x x
• Dog x x
• Collar x x
• Leash x x
9/4/2001 Information Organization and Retrieval
Boolean Logic
A B
BABA
BABA
BAC
BAC
AC
AC
:Law sDeMorgan'
9/4/2001 Information Organization and Retrieval
Boolean Queries– Usually expressed as INFIX operators in IR
• ((a AND b) OR (c AND b))
– NOT is UNARY PREFIX operator• ((a AND b) OR (c AND (NOT b)))
– AND and OR can be n-ary operators• (a AND b AND c AND d)
– Some rules - (De Morgan revisited)• NOT(a) AND NOT(b) = NOT(a OR b)• NOT(a) OR NOT(b)= NOT(a AND b)• NOT(NOT(a)) = a
9/4/2001 Information Organization and Retrieval
Boolean Logic
t33
t11 t22
D11D22
D33
D44D55
D66
D88D77
D99
D1010
D1111
m1
m2
m3m5
m4
m7m8
m6
m2 = t1 t2 t3
m1 = t1 t2 t3
m4 = t1 t2 t3
m3 = t1 t2 t3
m6 = t1 t2 t3
m5 = t1 t2 t3
m8 = t1 t2 t3
m7 = t1 t2 t3
9/4/2001 Information Organization and Retrieval
Boolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”
Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete
Cracks
Beams Widthmeasurement
Prestressedconcrete
Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)
9/4/2001 Information Organization and Retrieval
Psuedo-Boolean Queries
• A new notation, from web search– +cat dog +collar leash
• Does not mean the same thing!
• Need a way to group combinations.
• Phrases:– “stray cat” AND “frayed collar”– +“stray cat” + “frayed collar”
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
9/4/2001 Information Organization and Retrieval
Result Sets• Run a query, get a result set• Two choices
– Reformulate query, run on entire collection
– Reformulate query, run on result set
• Example: Dialog query• (Redford AND Newman)• -> S1 1450 documents• (S1 AND Sundance)• ->S2 898 documents
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
Reformulated Query
Re-Rank
9/4/2001 Information Organization and Retrieval
Ordering of Retrieved Documents• Pure Boolean has no ordering• In practice:
– order chronologically– order by total number of “hits” on query terms
• What if one term has more hits than others?• Is it better to one of each term or many of one term?
• Fancier methods have been investigated – p-norm is most famous
• usually impractical to implement• usually hard for user to understand
9/4/2001 Information Organization and Retrieval
Boolean• Advantages
– simple queries are easy to understand– relatively easy to implement
• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined
• Dominant language in commercial systems until the WWW
9/4/2001 Information Organization and Retrieval
Faceted Boolean Query
• Strategy: break query into facets (polysemous with earlier meaning of facets)
– conjunction of disjunctionsa1 OR a2 OR a3
b1 OR b2
c1 OR c2 OR c3 OR c4
– each facet expresses a topic“rain forest” OR jungle OR amazon
medicine OR remedy OR cure
Smith OR Zhou
AND
AND
9/4/2001 Information Organization and Retrieval
Faceted Boolean Query
• Query still fails if one facet missing
• Alternative: Coordination level ranking– Order results in terms of how many facets (disjuncts)
are satisfied
– Also called Quorum ranking, Overlap ranking, and Best Match
• Problem: Facets still undifferentiated
• Alternative: assign weights to facets
9/4/2001 Information Organization and Retrieval
Proximity Searches• Proximity: terms occur within K positions of one
another– pen w/5 paper
• A “Near” function can be more vague– near(pen, paper)
• Sometimes order can be specified• Also, Phrases and Collocations
– “United Nations” “Bill Clinton”
• Phrase Variants– “retrieval of information” “information retrieval”
9/4/2001 Information Organization and Retrieval
Filters
• Filters: Reduce set of candidate docs• Often specified simultaneous with query• Usually restrictions on metadata
– restrict by:• date range• internet domain (.edu .com .berkeley.edu)• author• size• limit number of documents returned
9/4/2001 Information Organization and Retrieval
Next
• Statistical Properties of Text
• Preparing information for search: Lexical analysis
• Introduction to the Vector Space model of IR.