2004.08.31 - SLIDE 1IS 202 - Fall 2004 Course Introduction Prof. Ray Larson & Prof. Marc Davis UC...

53
2004.08.31 - SLIDE 1 IS 202 - Fall 2004 Course Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2004 SIMS 202: Information Organization and Retrieval Credits to Marti Hearst for some of the slides in this lecture
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of 2004.08.31 - SLIDE 1IS 202 - Fall 2004 Course Introduction Prof. Ray Larson & Prof. Marc Davis UC...

2004.08.31 - SLIDE 1IS 202 - Fall 2004

Course Introduction

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 am

Fall 2004

SIMS 202:

Information Organization

and Retrieval

Credits to Marti Hearst for some of the slides in this lecture

2004.08.31 - SLIDE 2IS 202 - Fall 2004

Today

• Introductions

• Course Overview

• Administrivia

2004.08.31 - SLIDE 3IS 202 - Fall 2004

Today

• Introductions

• Course Overview

• Administrivia

2004.08.31 - SLIDE 4IS 202 - Fall 2004

IS202 Teaching Team

ProfessorRay Larson

ProfessorMarc Davis

TAAllison Billings

TATran Tu

2004.08.31 - SLIDE 5IS 202 - Fall 2004

Who Am I?

• Professor and Associate Dean at SIMS

• Here from the founding of SIMS, faculty member of the “previous school”

2004.08.31 - SLIDE 6IS 202 - Fall 2004

What Do I Do?

• Research– Design, development and evaluation of information

retrieval systems and digital libraries– Cheshire II and III– Bibliometrics of the WWW– Geographic information retrieval (GIR)– XML Retrieval– Applications of Grid computing to (large-scale) IR and

Digital Libraries– Distributed search and retrieval

• Teaching– Information Retrieval– Database Management

2004.08.31 - SLIDE 7IS 202 - Fall 2004

Who Am I?

• Assistant Professor at SIMS (School of Information Management and Systems)

• Background1980 – 1984 B.A. from Wesleyan University in the College of

Letters

1984 – 1987 M.A. from the University of Konstanz in Literary Theory and Philosophy

1990 – 1995 Ph.D. from MIT Media Laboratory in Media Arts and Sciences

1993 – 1998 Member of the Research Staff and Project Coordinator at Interval Research Corporation

1999 – 2002 Chairman and CTO of Amova

2004.08.31 - SLIDE 8IS 202 - Fall 2004

What Do I Do?

• Create technology and applications that will enable daily media consumers to become daily media producers

• Research and teaching in the theory, design, and development of digital media systems for creating and using media metadata to automate media production and reuse– Research

• Director of Garage Cinema Research• Projects in Media Metadata, Active Capture, Adaptive Media, Mobile

Media Metadata, and Social Uses of Personal Media• Executive Committee Member and Co-Founder of the Center for

New Media• Affiliated Faculty Member of the Berkeley Institute of Design

– Teaching• Multimedia Information• Digital Media Design Studio• Foundations of New Media

2004.08.31 - SLIDE 9IS 202 - Fall 2004

Student Introductions

• Who are you?– Name– Undergrad degree– Special areas of expertise and interest

• Why are you here?– What you want to learn from the course

2004.08.31 - SLIDE 10IS 202 - Fall 2004

Today

• Introductions

• Course Overview

• Administrivia

2004.08.31 - SLIDE 11IS 202 - Fall 2004

Goals of the Course

• Learn about– Design, development, and use of information

organization and retrieval systems– Practical and theoretical foundations of

information organization and analysis– Evaluation of information access systems– Cognitive and user-centric considerations– Hands-on experience with information

systems

2004.08.31 - SLIDE 12IS 202 - Fall 2004

Two Main Themes

Information Organization and

Design

Information Retrieval and the Search Process

2004.08.31 - SLIDE 13IS 202 - Fall 2004

Information Organization and Retrieval

• To organize is to (1) furnish with organs, make organic, make into living tissue, become organic; (2) form into an organic whole; give orderly structure to; frame and put into working order; make arrangements for.

• Knowledge is knowing, familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known.

• To retrieve is to (1) recover by investigation or effort of memory, restore to knowledge or recall to mind; regain possession of; (2) rescue from a bad state, revive, repair, set right.

• Information is (1) informing, telling; thing told, knowledge, items of knowledge, news.

The Oxford English Dictionary, cf. Rowley

2004.08.31 - SLIDE 14IS 202 - Fall 2004

(Approximate) Course Schedule

• Organization– Phone Project Introduction– Categorization– Knowledge Representation– Lexical Relations and

WordNet– Metadata Introduction– Controlled Vocabularies

Introduction– Facetted Classification– Thesaurus Design and

Construction– Semantic Web– Multimedia Information

Organization and Retrieval– Metadata for Media– Phone Project Presentations

• Retrieval– Overview– Introduction to the Search

Process– Boolean Queries and Text

Processing– Web Search Issues and

Architecture– Statistical Properties of

Text and Vector Representation

– Probabilistic Ranking & Relevance Feedback

– Evaluation– Interfaces for Information

Retrieval– Database Design

2004.08.31 - SLIDE 15IS 202 - Fall 2004

Information Properties

• Information can be communicated electronically– Broadcasting– Networking

• Information can be easily duplicated and shared– Problems of ownership– Problems of control

Adapted from ‘Silicon Dreams’ by Robert W. Lucky

2004.08.31 - SLIDE 16IS 202 - Fall 2004

Information Hierarchy

Wisdom

Knowledge

Information

Data

2004.08.31 - SLIDE 17IS 202 - Fall 2004

Information Hierarchy

• Data– The raw material of information

• Information– Data organized and presented by someone

• Knowledge– Information read, heard, or seen and

understood

• Wisdom– Distilled and integrated knowledge and

understanding

2004.08.31 - SLIDE 18IS 202 - Fall 2004

Information

Where is the Life we have lost in living?Where is the wisdom we have lost in knowledge?Where is the knowledge we have lost in information?

-- T.S. Eliot, “The Rock”

Where is the information we have lost in data?

2004.08.31 - SLIDE 19IS 202 - Fall 2004

Information Life Cycle

Creation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

2004.08.31 - SLIDE 20IS 202 - Fall 2004

Authoring/Modifying

• Converting data+information+knowledge

to new information

• Creating information from observation,

thought

• Editing and publication

• Gatekeeping

2004.08.31 - SLIDE 21IS 202 - Fall 2004

Organizing/Indexing

• Collecting and integrating information

• Affects data, information, and metadata

• “Metadata” describes data and information– More on this later

• Organizing information– Types of organization?

• Indexing

2004.08.31 - SLIDE 22IS 202 - Fall 2004

Storing/Retrieving

• Information storage – How and where is information stored?

• Retrieving information– How is information recovered from storage?– How do we find needed information?– Linked with accessing/filtering stage

2004.08.31 - SLIDE 23IS 202 - Fall 2004

Distribution/Networking

• Transmission of information

– How is information transmitted?

• Networks vs. broadcast

2004.08.31 - SLIDE 24IS 202 - Fall 2004

Accessing/Filtering

• Using the organization created in the O/I stage to:– Select desired (or relevant) information– Locate that information– Retrieve the information from its storage

location (often via a network)

2004.08.31 - SLIDE 25IS 202 - Fall 2004

Using/Creating

• Using information

• Transformation of information to

knowledge

• Knowledge to new data and new

information

2004.08.31 - SLIDE 26IS 202 - Fall 2004

Key Issues in This Course

• How to find the appropriate information resources for someone’s (or your own) needs– Retrieving

• How to describe information resources in ways so that they may be effectively used by those who need to use them– Organizing

2004.08.31 - SLIDE 27IS 202 - Fall 2004

Key Issues

Creation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

2004.08.31 - SLIDE 28IS 202 - Fall 2004

(Approximate) Course Schedule

• Organization– Phone Project Introduction– Categorization– Knowledge Representation– Lexical Relations and

WordNet– Metadata Introduction– Controlled Vocabularies

Introduction– Facetted Classification– Thesaurus Design and

Construction– Semantic Web– Multimedia Information

Organization and Retrieval– Metadata for Media– Phone Project Presentations

• Retrieval– Overview– Introduction to the Search

Process– Boolean Queries and Text

Processing– Web Search Issues and

Architecture– Statistical Properties of

Text and Vector Representation

– Probabilistic Ranking & Relevance Feedback

– Evaluation– Interfaces for Information

Retrieval– Database Design

2004.08.31 - SLIDE 29IS 202 - Fall 2004

Web Search Questions

• What do people search for?

• How do people use search engines?– How often do people find what they are

looking for?

– How difficult is it for people to find what they are looking for?

• How can search engines be improved?

2004.08.31 - SLIDE 30IS 202 - Fall 2004

What Do People Search for on the Web?

• Study by Spink et al., Oct 98– www.shef.ac.uk/~is/publications/infres/paper53.html

– Survey on Excite, 13 questions– Data for 316 surveys

2004.08.31 - SLIDE 31IS 202 - Fall 2004

What Do People Search for on the Web?

• Topics• Genealogy/Public Figure: 12%• Computer related: 12%• Business: 12%• Entertainment: 8%• Medical: 8%• Politics & Government 7%• News 7%• Hobbies 6%• General info/surfing 6%• Science 6%• Travel 5%• Arts/education/shopping/images 14%

• Something is missing…

2004.08.31 - SLIDE 32IS 202 - Fall 2004

What Do People Search for on the Web?

• 4660 sex• 3129 yahoo• 2191 internal site admin

check from kho• 1520 chat• 1498 porn• 1315 horoscopes• 1284 pokemon• 1283 SiteScope test

• 1223 hotmail• 1163 games• 1151 mp3• 1140 weather• 1127 www.yahoo.com• 1110 maps• 1036 yahoo.com• 983 ebay• 980 recipes

50,000 queries from excite 1997

Most frequent terms:

2004.08.31 - SLIDE 33IS 202 - Fall 2004

Why Do These Differ?

• Self-reporting survey

• The nature of language– Only a few ways to say certain things

– Many different ways to express most concepts• UFO, flying saucer, space ship, satellite

• How many ways are there to talk about history?

2004.08.31 - SLIDE 34IS 202 - Fall 2004

• 65002930 the• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are• 30998255 from

• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page• 22292805 about• 22265579 com• 22107392 information

Source: http://elib.cs.berkeley.edu/docfreq/index.html

What is on the Web?

2004.08.31 - SLIDE 35IS 202 - Fall 2004

Intranet Queries (Aug 2000)

• 3351 bearfacts• 3349 telebears• 1909 extension• 1874 schedule+of+classes• 1780 bearlink• 1737 bear+facts• 1468 decal• 1443 infobears• 1227 calendar• 989 career+center• 974 campus+map• 920 academic+calendar• 840 map

• 773 bookstore• 741 class+pass• 738 housing• 721 tele-bears• 716 directory• 667 schedule• 627 recipes• 602 transcripts• 582 tuition• 577 seti• 563 registrar• 550 info+bears• 543 class+schedule• 470 financial+aid

2004.08.31 - SLIDE 36IS 202 - Fall 2004

Intranet Queries

• Summary of sample data from 3 weeks of UCB queries– 13.2% Telebears/BearFacts/InfoBears/BearLink (12297)– 6.7% Schedule of classes or final exams (6222)– 5.4% Summer Session (5041)– 3.2% Extension (2932)– 3.1% Academic Calendar (2846)– 2.4% Directories (2202)– 1.7% Career Center (1588)– 1.7% Housing (1583)– 1.5% Map (1393)

• Average query length over last 4 months: 1.8 words• This suggests what is difficult to find from the home page

2004.08.31 - SLIDE 37IS 202 - Fall 2004

Queries as Zeitgeist

From: http:://www.google.com/press/zeitgeist.html

2004.08.31 - SLIDE 38IS 202 - Fall 2004

IR Issues in the Course

• What metadata is collected

• How the indexes are created

• How queries are formed

• How documents are ranked

• How shortest paths are computed

• How the system is built– … among other things!– This is just an introduction! Much more on

these issues in the first half of the course

2004.08.31 - SLIDE 39IS 202 - Fall 2004

IO Issues in the Course

• How do people categorize and represent information?

• What types of metadata are there and how do we construct and use them?

• How do we create ontologies for representing information, especially opaque data like photographs?

• What new uses and applications will metadata enable, especially for mobile media?– … among other things!– This is just an introduction! Much more on these

issues in the second half of the course

2004.08.31 - SLIDE 40IS 202 - Fall 2004

Course Format

• Most classes will be lecture/discussion sessions– Lecture ~55 minutes– Discussion ~25 minutes

• For each class students will prepare discussion questions for each reading and help lead discussion

• Active participation is essential to your learning

• Some classes will be working sessions– Phone Project Presentations– Final Review

• Some classes will be exams– Midterm Exam– Final Exam

2004.08.31 - SLIDE 41IS 202 - Fall 2004

IS202 Course Project

2004.08.31 - SLIDE 42IS 202 - Fall 2004

Moore’s Law for Cameras2000

Kodak DC40

Nintendo GameBoy Camera

$400

$ 40

2002

Kodak DX4900

SiPix StyleCam Blink

2004.08.31 - SLIDE 43IS 202 - Fall 2004

Capture+Processing+Interaction+Network

2004.08.31 - SLIDE 44IS 202 - Fall 2004

Camera Phones as Platform

• Media capture (images, video, audio)

• Programmable processing using open standard operating systems, programming languages, and APIs

• Wireless networking• Personal information

management functions• Rich user interaction modalities• Time, location, and user

contextual metadata

2004.08.31 - SLIDE 45IS 202 - Fall 2004

Camera Phones as Platform

• In the first half of 2003, more camera phones were sold worldwide than digital cameras

• By 2008, the average camera phone is predicted to have 5 megapixel resolution

• Last month Casio and Samsung introduced 3.2 megapixel camera phones with optical zoom and photo flash

• There are more cell phone users in China than people in the United States (300 million)

• For 90% of the world their “computer” is their cell phone

2004.08.31 - SLIDE 46IS 202 - Fall 2004

Phone Project Goals

• Experience the actual process of information organization and retrieval– Especially as regards mobile media metadata creation, sharing,

and (re)use• Work in small, focused teams performing a variety of

tasks– Mobile image capture and sharing– Ontology creation– Image annotation– Mobile media application design

• Explore and design new applications for an emerging information organization and retrieval platform

• Develop an ongoing resource for SIMS (an annotated photo database) for– Internal research and teaching– External promotional and informational purposes

2004.08.31 - SLIDE 47IS 202 - Fall 2004

Phone Project Requirements

• Create engaging and useful application scenarios and photos

• Create a shared, reusable resource of annotated photos – All photos will be stored in one directory– Design your metadata

• So that all photos would be accessible from all applications

• Not only for the needs of your particular application, but also for the reusability of your photos and metadata

2004.08.31 - SLIDE 48IS 202 - Fall 2004

Assignments and Exams

• Approximately 12 assignments– Most due within one week to ten days– In second half, most related to the Phone Project– Sometimes “checked”, sometimes graded

• Final exam (during finals week)• Grading

– Assignments: 60%• Not evenly weighted

– Final: 25%– Class Participation: 15%

2004.08.31 - SLIDE 49IS 202 - Fall 2004

Today

• Introductions

• Course Overview

• Administrivia

2004.08.31 - SLIDE 50IS 202 - Fall 2004

Readings

• Course Reader Part I of II– Should be available today at Copy Central on

Bancroft

• Textbooks• Modern Information Retrieval, Baeza-Yates and

Ribiero-Neto (Eds.), Addison Wesley, 1999• The Organization of Information, 2nd Edition. Arlene

G. Taylor, Libraries Unlimited, 1999,

2004.08.31 - SLIDE 51IS 202 - Fall 2004

Recommended Course

• INFOSYS 290 / Section 16 XML Foundations

• Instructor: Bob Glushko

• Units: 1

• W 12:30-2Th 3:30-5(5 weeks only: Sept 8 - Oct 7)110 South Hall 

2004.08.31 - SLIDE 52IS 202 - Fall 2004

For Next Time (!)

• Readings– Borges, Dennett, and Reddy (in reader,

Borges is also online via the class web site)

• On-Line Questionnaire– Information about you– Assignment 1 on “What is information,

according to your background or area of expertise?”

– Due this Thursday, Sept 2

2004.08.31 - SLIDE 53IS 202 - Fall 2004

Next Time

• More on what is information?

• And how much of it is out there?

• Discussion Questions for:– Borges?– Dennett?– Reddy?