SIMS 202 Information Organization and Retrieval
description
Transcript of SIMS 202 Information Organization and Retrieval
![Page 1: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/1.jpg)
SIMS 202Information Organization
and Retrieval
Prof. Marti Hearst and Prof. Ray LarsonUC Berkeley SIMS
Tues/Thurs 9:30-11:00amFall 2000
![Page 2: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/2.jpg)
Today
Introductions Course Overview Administrivia
![Page 3: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/3.jpg)
Goals of the Course
Learn about:– Design, development and use of
information storage and retrieval systems– Practical and theoretical foundations of
information organization and analysis– Evaluation of information access systems– Cognitive and user-centric considerations– Hands-on experience with information
systems
![Page 4: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/4.jpg)
Two Main Themes
Information Organization and
Design
Information Retrieval and the Search Process
![Page 5: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/5.jpg)
Web Search Questions
What do people search for? How do people use search
engines?– How often do people find what they
are looking for? – How difficult is it for people to find
what they are looking for? How can search engines be
improved?
![Page 6: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/6.jpg)
What Do People Search for on the Web?
Study by Spink et al., Oct 98– www.shef.ac.uk/~is/publications/infres/paper53.html
– Survey on Excite, 13 questions– Data for 316 surveys
![Page 7: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/7.jpg)
What Do People Search for on the Web?
Topics» Genealogy/Public Figure: 12%» Computer related: 12%» Business: 12%» Entertainment: 8%» Medical: 8%» Politics & Government 7%» News 7%» Hobbies 6%» General info/surfing 6%» Science 6%» Travel 5%» Arts/education/shopping/images 14%
Something is missing…
![Page 8: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/8.jpg)
What do people search for on the web?
4660 sex 3129 yahoo 2191 internal site
admin check from kho
1520 chat 1498 porn 1315 horoscopes 1284 pokemon 1283 SiteScope test
1223 hotmail 1163 games 1151 mp3 1140 weather 1127 www.yahoo.com 1110 maps 1036 yahoo.com 983 ebay 980 recipes
50,000 queries from excite 1997 Most frequent terms:
![Page 9: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/9.jpg)
Why do these differ?
Self-reporting survey The nature of language
– Only a few ways to say certain things– Many different ways to express most
concepts»UFO, Flying Saucer, Space Ship, Satellite»How many ways are there to talk about
history?
![Page 10: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/10.jpg)
Intranet Queries (Aug 2000) 3351 bearfacts 3349 telebears 1909 extension 1874
schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map
773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid
![Page 11: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/11.jpg)
Intranet Queries Summary of sample data from 3 weeks of UCB
queries– 13.2% Telebears/BearFacts/InfoBears/BearLink (12297)– 6.7% Schedule of classes or final exams (6222)– 5.4% Summer Session (5041)– 3.2% Extension (2932)– 3.1% Academic Calendar (2846)– 2.4% Directories (2202)– 1.7% Career Center (1588)– 1.7% Housing (1583)– 1.5% Map (1393)
Average query length over last 4 months: 1.8 words
This suggests what is difficult to find from the home page
![Page 12: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/12.jpg)
An Example Search System:Cha-Cha
A system for searching complex intranets
Places retrieval results in context
![Page 13: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/13.jpg)
An Example Search System: Cha-Cha
Important design goals:– Users at any level of computer
expertise– Browsers at any version level– Computers of any speed
![Page 14: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/14.jpg)
![Page 15: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/15.jpg)
![Page 16: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/16.jpg)
Search: Where to Start? Guess words?
– Search engine plunges you into the middle of a site/collection
– Too many or too few results– No context
Use a directory?– If large, may be difficult/frustrating to navigate– Several ways to organize the information– May not reflect users’ needs
Solution: Integrate Browsing and Search
![Page 17: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/17.jpg)
![Page 18: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/18.jpg)
![Page 19: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/19.jpg)
How Cha-Cha Works Crawl entire Intranet Compute the shortest hyperlink path from a
certain root page to every web page Index and compute metadata for the pages
– Using Cheshire II (by Ray Larson) Run a user query.
– Gather all the hits– Create a “directory” based on combining the
shortest paths– Special graph algorithm removes redundant links
and internal nodes
![Page 20: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/20.jpg)
Cha-Cha System Architecture
crawl theweb
store the
documents
![Page 21: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/21.jpg)
Cha-Cha System Architecture
crawl theweb
store the
documents
create files of
metadata
Cheshire II
![Page 22: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/22.jpg)
Cha-Cha Metadata
Information about web pages– Title– Length– Inlinks– Outlinks– Shortest Paths from a root home page
Used to provide innovative search interface
![Page 23: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/23.jpg)
Cha-Cha System Architecture
crawl theweb
store the
documents
create files of
metadata
Cheshire II
![Page 24: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/24.jpg)
Cha-Cha System Architecture
crawl theweb
create a keyword
index
store the
documents
create files of
metadata
Cheshire II
![Page 25: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/25.jpg)
Creating a Keyword Index
For each document– Tokenize the document
»Break it up into tokens: words, stems, punctuation
»There are many variations on this
– Record which tokens occurred in this document»Called an Inverted Index»Dictionary: a record of all the tokens in the
collection and their overall frequency»Postings File: a list recording for each token,
which document it occurs in and how often it occurs
![Page 26: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/26.jpg)
Cha-Cha System Architecture
Cheshire II
userquery
![Page 27: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/27.jpg)
Responding to the User Query
User searches on “pam samuelson” Search Engine looks up documents
indexed with one or both terms in its inverted index
Search Engine looks up titles and shortest paths in the metadata index
User Interface combines the information and presents the results as HTML
![Page 28: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/28.jpg)
Cha-Cha System Architecture
Cheshire II
userquery
![Page 29: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/29.jpg)
Cha-Cha System Architecture
Cheshire II
server accesses the
databases
![Page 30: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/30.jpg)
Cha-Cha System Architecture
Cheshire II
results shownto user
![Page 31: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/31.jpg)
Cha-Cha System Architecture
Cheshire II
results shownto user
server accesses the
databases
userquery
![Page 32: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/32.jpg)
What hasn’t been explained here?
How documents are ranked How queries are formed How shortest paths are computed How the system is built
– … among other things!– This is just an introduction! Much
more later.
![Page 33: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/33.jpg)
Course Schedule
Retrieval– The Search Process– Content Analysis
» Tokenization, Zipf’s Law, Lexical Associations
– IR Implementation– Term weighting and
document ranking» Vector space model» Probabilistic model
– User Interfaces» Overviews, query
specification, providing context, relevance feedback
![Page 34: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/34.jpg)
Two Main Themes
Information Organization and
Design
Information Retrieval and the Search Process
![Page 35: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/35.jpg)
Overview Example
Web site design– Incorporates many of the
organizational issues we will be covering
– Example taken from a study of professional designers, by Mark Newman
![Page 36: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/36.jpg)
Adapted from slide by Mark Newman
Web Site Design
Information design– structure, categories
of information Navigation design
– interaction with information structure
Graphic design– visual presentation of
information and navigation (color, typography, etc.)
![Page 37: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/37.jpg)
Adapted from slide by Mark Newman
Design Specialties
Information Architecture– includes management
and more responsibility for content
User Interface Design– includes testing and
evaluation
![Page 38: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/38.jpg)
Adapted from slide by Mark Newman
Web Site Design Process
Implementation
Design
Preliminary Design
Conceptualization
Needs Assessment
![Page 39: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/39.jpg)
Adapted from slide by Mark Newman
Design Process: Preliminary Design
(information/navigation design: schematic)
![Page 40: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/40.jpg)
Adapted from slide by Mark Newman
Design Process: Preliminary Design
(navigation design: storyboard)
![Page 41: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/41.jpg)
Web Site Design Process
Major design activities are:– Deciding on a set of categories that define the
information content– Deciding how to represent these– Deciding on the navigation structure through
the categorized content» Example: a movie listing website
There are similarities and differences to:– Database design– Thesaurus design
![Page 42: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/42.jpg)
Course Schedule
Organization– Overview– Metadata and
Markup– Controlled
Vocabularies, Classification, Thesauri
– Information Design» Thesaurus Design» Database Design
![Page 43: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/43.jpg)
Assignments and Exams Approximately 9 short assignments (due
within one week – ten days)– Sometimes “checked”, sometimes graded
One Midterm– Might be a project, might be an exam (TBA)
Final exam Monday Dec 11 Grading:
– Assignments: 40%» Not evenly weighted
– Final: 25%– Midterm: 25%– Class Participation: 10%
![Page 44: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/44.jpg)
Readings
Course Reader– Will be available in about a week at Ned’s– (on Bancroft, across from the ASUC)
Textbooks– Modern Information Retrieval, Baeza-Yates
and Ribiero-Neto (Eds.), Addison Wesley, 1999
– The Organization of Information, Arlene G. Taylor, Libraries Unlimited, 1999,
![Page 45: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/45.jpg)
Homework (!)
Read the handout (Borges and Dennett) Write one or two paragraphs on
– What is information, according to your background or area of expertise?
Due in class this Thursday, Aug 31.
![Page 46: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/46.jpg)
What is Information?
There is no “correct” definition Can involve philosophy, psychology, signal
processing, physics Cookie Monster’s definition:
– “news or facts about something” Oxford English Dictionary
– information: informing, telling; thing told, knowledge, items of knowledge, news
– knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known
![Page 47: SIMS 202 Information Organization and Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813c60550346895da5e77d/html5/thumbnails/47.jpg)
Next Time
More on What is Information?