2004.08.31 - SLIDE 1IS 202 - Fall 2004 Course Introduction Prof. Ray Larson & Prof. Marc Davis UC...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of 2004.08.31 - SLIDE 1IS 202 - Fall 2004 Course Introduction Prof. Ray Larson & Prof. Marc Davis UC...
2004.08.31 - SLIDE 1IS 202 - Fall 2004
Course Introduction
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 am
Fall 2004
SIMS 202:
Information Organization
and Retrieval
Credits to Marti Hearst for some of the slides in this lecture
2004.08.31 - SLIDE 4IS 202 - Fall 2004
IS202 Teaching Team
ProfessorRay Larson
ProfessorMarc Davis
TAAllison Billings
TATran Tu
2004.08.31 - SLIDE 5IS 202 - Fall 2004
Who Am I?
• Professor and Associate Dean at SIMS
• Here from the founding of SIMS, faculty member of the “previous school”
2004.08.31 - SLIDE 6IS 202 - Fall 2004
What Do I Do?
• Research– Design, development and evaluation of information
retrieval systems and digital libraries– Cheshire II and III– Bibliometrics of the WWW– Geographic information retrieval (GIR)– XML Retrieval– Applications of Grid computing to (large-scale) IR and
Digital Libraries– Distributed search and retrieval
• Teaching– Information Retrieval– Database Management
2004.08.31 - SLIDE 7IS 202 - Fall 2004
Who Am I?
• Assistant Professor at SIMS (School of Information Management and Systems)
• Background1980 – 1984 B.A. from Wesleyan University in the College of
Letters
1984 – 1987 M.A. from the University of Konstanz in Literary Theory and Philosophy
1990 – 1995 Ph.D. from MIT Media Laboratory in Media Arts and Sciences
1993 – 1998 Member of the Research Staff and Project Coordinator at Interval Research Corporation
1999 – 2002 Chairman and CTO of Amova
2004.08.31 - SLIDE 8IS 202 - Fall 2004
What Do I Do?
• Create technology and applications that will enable daily media consumers to become daily media producers
• Research and teaching in the theory, design, and development of digital media systems for creating and using media metadata to automate media production and reuse– Research
• Director of Garage Cinema Research• Projects in Media Metadata, Active Capture, Adaptive Media, Mobile
Media Metadata, and Social Uses of Personal Media• Executive Committee Member and Co-Founder of the Center for
New Media• Affiliated Faculty Member of the Berkeley Institute of Design
– Teaching• Multimedia Information• Digital Media Design Studio• Foundations of New Media
2004.08.31 - SLIDE 9IS 202 - Fall 2004
Student Introductions
• Who are you?– Name– Undergrad degree– Special areas of expertise and interest
• Why are you here?– What you want to learn from the course
2004.08.31 - SLIDE 11IS 202 - Fall 2004
Goals of the Course
• Learn about– Design, development, and use of information
organization and retrieval systems– Practical and theoretical foundations of
information organization and analysis– Evaluation of information access systems– Cognitive and user-centric considerations– Hands-on experience with information
systems
2004.08.31 - SLIDE 12IS 202 - Fall 2004
Two Main Themes
Information Organization and
Design
Information Retrieval and the Search Process
2004.08.31 - SLIDE 13IS 202 - Fall 2004
Information Organization and Retrieval
• To organize is to (1) furnish with organs, make organic, make into living tissue, become organic; (2) form into an organic whole; give orderly structure to; frame and put into working order; make arrangements for.
• Knowledge is knowing, familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known.
• To retrieve is to (1) recover by investigation or effort of memory, restore to knowledge or recall to mind; regain possession of; (2) rescue from a bad state, revive, repair, set right.
• Information is (1) informing, telling; thing told, knowledge, items of knowledge, news.
The Oxford English Dictionary, cf. Rowley
2004.08.31 - SLIDE 14IS 202 - Fall 2004
(Approximate) Course Schedule
• Organization– Phone Project Introduction– Categorization– Knowledge Representation– Lexical Relations and
WordNet– Metadata Introduction– Controlled Vocabularies
Introduction– Facetted Classification– Thesaurus Design and
Construction– Semantic Web– Multimedia Information
Organization and Retrieval– Metadata for Media– Phone Project Presentations
• Retrieval– Overview– Introduction to the Search
Process– Boolean Queries and Text
Processing– Web Search Issues and
Architecture– Statistical Properties of
Text and Vector Representation
– Probabilistic Ranking & Relevance Feedback
– Evaluation– Interfaces for Information
Retrieval– Database Design
2004.08.31 - SLIDE 15IS 202 - Fall 2004
Information Properties
• Information can be communicated electronically– Broadcasting– Networking
• Information can be easily duplicated and shared– Problems of ownership– Problems of control
Adapted from ‘Silicon Dreams’ by Robert W. Lucky
2004.08.31 - SLIDE 17IS 202 - Fall 2004
Information Hierarchy
• Data– The raw material of information
• Information– Data organized and presented by someone
• Knowledge– Information read, heard, or seen and
understood
• Wisdom– Distilled and integrated knowledge and
understanding
2004.08.31 - SLIDE 18IS 202 - Fall 2004
Information
Where is the Life we have lost in living?Where is the wisdom we have lost in knowledge?Where is the knowledge we have lost in information?
-- T.S. Eliot, “The Rock”
Where is the information we have lost in data?
2004.08.31 - SLIDE 19IS 202 - Fall 2004
Information Life Cycle
Creation
Utilization Searching
Active
Inactive
Semi-Active
Retention/Mining
Disposition
Discard
Using Creating
AuthoringModifying
OrganizingIndexing
StoringRetrieval
DistributionNetworking
AccessingFiltering
2004.08.31 - SLIDE 20IS 202 - Fall 2004
Authoring/Modifying
• Converting data+information+knowledge
to new information
• Creating information from observation,
thought
• Editing and publication
• Gatekeeping
2004.08.31 - SLIDE 21IS 202 - Fall 2004
Organizing/Indexing
• Collecting and integrating information
• Affects data, information, and metadata
• “Metadata” describes data and information– More on this later
• Organizing information– Types of organization?
• Indexing
2004.08.31 - SLIDE 22IS 202 - Fall 2004
Storing/Retrieving
• Information storage – How and where is information stored?
• Retrieving information– How is information recovered from storage?– How do we find needed information?– Linked with accessing/filtering stage
2004.08.31 - SLIDE 23IS 202 - Fall 2004
Distribution/Networking
• Transmission of information
– How is information transmitted?
• Networks vs. broadcast
2004.08.31 - SLIDE 24IS 202 - Fall 2004
Accessing/Filtering
• Using the organization created in the O/I stage to:– Select desired (or relevant) information– Locate that information– Retrieve the information from its storage
location (often via a network)
2004.08.31 - SLIDE 25IS 202 - Fall 2004
Using/Creating
• Using information
• Transformation of information to
knowledge
• Knowledge to new data and new
information
2004.08.31 - SLIDE 26IS 202 - Fall 2004
Key Issues in This Course
• How to find the appropriate information resources for someone’s (or your own) needs– Retrieving
• How to describe information resources in ways so that they may be effectively used by those who need to use them– Organizing
2004.08.31 - SLIDE 27IS 202 - Fall 2004
Key Issues
Creation
Utilization Searching
Active
Inactive
Semi-Active
Retention/Mining
Disposition
Discard
Using Creating
AuthoringModifying
OrganizingIndexing
StoringRetrieval
DistributionNetworking
AccessingFiltering
2004.08.31 - SLIDE 28IS 202 - Fall 2004
(Approximate) Course Schedule
• Organization– Phone Project Introduction– Categorization– Knowledge Representation– Lexical Relations and
WordNet– Metadata Introduction– Controlled Vocabularies
Introduction– Facetted Classification– Thesaurus Design and
Construction– Semantic Web– Multimedia Information
Organization and Retrieval– Metadata for Media– Phone Project Presentations
• Retrieval– Overview– Introduction to the Search
Process– Boolean Queries and Text
Processing– Web Search Issues and
Architecture– Statistical Properties of
Text and Vector Representation
– Probabilistic Ranking & Relevance Feedback
– Evaluation– Interfaces for Information
Retrieval– Database Design
2004.08.31 - SLIDE 29IS 202 - Fall 2004
Web Search Questions
• What do people search for?
• How do people use search engines?– How often do people find what they are
looking for?
– How difficult is it for people to find what they are looking for?
• How can search engines be improved?
2004.08.31 - SLIDE 30IS 202 - Fall 2004
What Do People Search for on the Web?
• Study by Spink et al., Oct 98– www.shef.ac.uk/~is/publications/infres/paper53.html
– Survey on Excite, 13 questions– Data for 316 surveys
2004.08.31 - SLIDE 31IS 202 - Fall 2004
What Do People Search for on the Web?
• Topics• Genealogy/Public Figure: 12%• Computer related: 12%• Business: 12%• Entertainment: 8%• Medical: 8%• Politics & Government 7%• News 7%• Hobbies 6%• General info/surfing 6%• Science 6%• Travel 5%• Arts/education/shopping/images 14%
• Something is missing…
2004.08.31 - SLIDE 32IS 202 - Fall 2004
What Do People Search for on the Web?
• 4660 sex• 3129 yahoo• 2191 internal site admin
check from kho• 1520 chat• 1498 porn• 1315 horoscopes• 1284 pokemon• 1283 SiteScope test
• 1223 hotmail• 1163 games• 1151 mp3• 1140 weather• 1127 www.yahoo.com• 1110 maps• 1036 yahoo.com• 983 ebay• 980 recipes
50,000 queries from excite 1997
Most frequent terms:
2004.08.31 - SLIDE 33IS 202 - Fall 2004
Why Do These Differ?
• Self-reporting survey
• The nature of language– Only a few ways to say certain things
– Many different ways to express most concepts• UFO, flying saucer, space ship, satellite
• How many ways are there to talk about history?
2004.08.31 - SLIDE 34IS 202 - Fall 2004
• 65002930 the• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are• 30998255 from
• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page• 22292805 about• 22265579 com• 22107392 information
Source: http://elib.cs.berkeley.edu/docfreq/index.html
What is on the Web?
2004.08.31 - SLIDE 35IS 202 - Fall 2004
Intranet Queries (Aug 2000)
• 3351 bearfacts• 3349 telebears• 1909 extension• 1874 schedule+of+classes• 1780 bearlink• 1737 bear+facts• 1468 decal• 1443 infobears• 1227 calendar• 989 career+center• 974 campus+map• 920 academic+calendar• 840 map
• 773 bookstore• 741 class+pass• 738 housing• 721 tele-bears• 716 directory• 667 schedule• 627 recipes• 602 transcripts• 582 tuition• 577 seti• 563 registrar• 550 info+bears• 543 class+schedule• 470 financial+aid
2004.08.31 - SLIDE 36IS 202 - Fall 2004
Intranet Queries
• Summary of sample data from 3 weeks of UCB queries– 13.2% Telebears/BearFacts/InfoBears/BearLink (12297)– 6.7% Schedule of classes or final exams (6222)– 5.4% Summer Session (5041)– 3.2% Extension (2932)– 3.1% Academic Calendar (2846)– 2.4% Directories (2202)– 1.7% Career Center (1588)– 1.7% Housing (1583)– 1.5% Map (1393)
• Average query length over last 4 months: 1.8 words• This suggests what is difficult to find from the home page
2004.08.31 - SLIDE 37IS 202 - Fall 2004
Queries as Zeitgeist
From: http:://www.google.com/press/zeitgeist.html
2004.08.31 - SLIDE 38IS 202 - Fall 2004
IR Issues in the Course
• What metadata is collected
• How the indexes are created
• How queries are formed
• How documents are ranked
• How shortest paths are computed
• How the system is built– … among other things!– This is just an introduction! Much more on
these issues in the first half of the course
2004.08.31 - SLIDE 39IS 202 - Fall 2004
IO Issues in the Course
• How do people categorize and represent information?
• What types of metadata are there and how do we construct and use them?
• How do we create ontologies for representing information, especially opaque data like photographs?
• What new uses and applications will metadata enable, especially for mobile media?– … among other things!– This is just an introduction! Much more on these
issues in the second half of the course
2004.08.31 - SLIDE 40IS 202 - Fall 2004
Course Format
• Most classes will be lecture/discussion sessions– Lecture ~55 minutes– Discussion ~25 minutes
• For each class students will prepare discussion questions for each reading and help lead discussion
• Active participation is essential to your learning
• Some classes will be working sessions– Phone Project Presentations– Final Review
• Some classes will be exams– Midterm Exam– Final Exam
2004.08.31 - SLIDE 42IS 202 - Fall 2004
Moore’s Law for Cameras2000
Kodak DC40
Nintendo GameBoy Camera
$400
$ 40
2002
Kodak DX4900
SiPix StyleCam Blink
2004.08.31 - SLIDE 44IS 202 - Fall 2004
Camera Phones as Platform
• Media capture (images, video, audio)
• Programmable processing using open standard operating systems, programming languages, and APIs
• Wireless networking• Personal information
management functions• Rich user interaction modalities• Time, location, and user
contextual metadata
2004.08.31 - SLIDE 45IS 202 - Fall 2004
Camera Phones as Platform
• In the first half of 2003, more camera phones were sold worldwide than digital cameras
• By 2008, the average camera phone is predicted to have 5 megapixel resolution
• Last month Casio and Samsung introduced 3.2 megapixel camera phones with optical zoom and photo flash
• There are more cell phone users in China than people in the United States (300 million)
• For 90% of the world their “computer” is their cell phone
2004.08.31 - SLIDE 46IS 202 - Fall 2004
Phone Project Goals
• Experience the actual process of information organization and retrieval– Especially as regards mobile media metadata creation, sharing,
and (re)use• Work in small, focused teams performing a variety of
tasks– Mobile image capture and sharing– Ontology creation– Image annotation– Mobile media application design
• Explore and design new applications for an emerging information organization and retrieval platform
• Develop an ongoing resource for SIMS (an annotated photo database) for– Internal research and teaching– External promotional and informational purposes
2004.08.31 - SLIDE 47IS 202 - Fall 2004
Phone Project Requirements
• Create engaging and useful application scenarios and photos
• Create a shared, reusable resource of annotated photos – All photos will be stored in one directory– Design your metadata
• So that all photos would be accessible from all applications
• Not only for the needs of your particular application, but also for the reusability of your photos and metadata
2004.08.31 - SLIDE 48IS 202 - Fall 2004
Assignments and Exams
• Approximately 12 assignments– Most due within one week to ten days– In second half, most related to the Phone Project– Sometimes “checked”, sometimes graded
• Final exam (during finals week)• Grading
– Assignments: 60%• Not evenly weighted
– Final: 25%– Class Participation: 15%
2004.08.31 - SLIDE 50IS 202 - Fall 2004
Readings
• Course Reader Part I of II– Should be available today at Copy Central on
Bancroft
• Textbooks• Modern Information Retrieval, Baeza-Yates and
Ribiero-Neto (Eds.), Addison Wesley, 1999• The Organization of Information, 2nd Edition. Arlene
G. Taylor, Libraries Unlimited, 1999,
2004.08.31 - SLIDE 51IS 202 - Fall 2004
Recommended Course
• INFOSYS 290 / Section 16 XML Foundations
• Instructor: Bob Glushko
• Units: 1
• W 12:30-2Th 3:30-5(5 weeks only: Sept 8 - Oct 7)110 South Hall
2004.08.31 - SLIDE 52IS 202 - Fall 2004
For Next Time (!)
• Readings– Borges, Dennett, and Reddy (in reader,
Borges is also online via the class web site)
• On-Line Questionnaire– Information about you– Assignment 1 on “What is information,
according to your background or area of expertise?”
– Due this Thursday, Sept 2