cs707_010312

8/3/2019 cs707_010312

1/23

Prasad L1IntroIR 1

Information Retrieval

Adapted from Lectures by

Berthier Ribeiro-Neto (Brazil),

Prabhakar Raghavan (Yahoo and Stanford)and Christopher Manning (Stanford)

8/3/2019 cs707_010312

2/23

Prasad L1IntroIR 2

Unstructured (text) vs. structured

(database) data in 1996

0

20

40

60

80

100

120

140

160

Data volume Market Cap

Unstructured

Structured

8/3/2019 cs707_010312

3/23

Prasad L1IntroIR 3

Unstructured (text) vs. structured

(database) data in 2006

0

20

40

60

80

100

120

140

160

Data volume Market Cap

Unstructured

Structured

8/3/2019 cs707_010312

4/23

Prasad L1IntroIR 4

Structured vs unstructured data

Structured data : information in tables

Employee Manager SalarySmith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.

8/3/2019 cs707_010312

5/23

Prasad L1IntroIR 5

Unstructured data

Typically refers to free textqData which does not have clear, semanticallyovert, easy-for-a-computer structure

AllowsqKeyword-based queries including operatorsqMore sophisticated concept queries, e.g.,

find all web pages dealing with drug abuse

8/3/2019 cs707_010312

6/23

Prasad L1IntroIR 6

Semi-structured data

In fact almost no data is unstructuredqE.g., this slide has distinctly identified zones

such as the Title and Bullets

Facilitates semi-structured search suchas

qTitle contains data AND Bullets containsearch to say nothing of linguistic structure

8/3/2019 cs707_010312

7/23

Prasad L1IntroIR 7

What is IR?

Representation Keywords/Phrases, Structure/Fonts, Counts, etc

Organization and Storage

Inverted File Index, Compressed, etc Hardware Architecture and Memory Hierarchy

Access to information items Interface : Spell-checker to tree-structured display Visualization : Labeled Clusters, Timelines, Spring graphs,

etc.

8/3/2019 cs707_010312

8/23

Prasad L1IntroIR 8

Ultimate Focus of IR

Satisfying user information needqEmphasis is on retrieval of information (not data)

User information need : ExamplesqPrinter reviews;Printer prices and availabilityqWords in which all vowels appearqAnagram/Permutations ofartqFlight numbers; UPS/FedEx/USPS Tracking code

Predicting which documents are relevant,and then linearly ranking them.

8/3/2019 cs707_010312

9/23

Prasad L1IntroIR 9

Information Need : Query, Relevancy

An information needis the topic about which theuser desires to know more, and is differentiated

from a query, which is what the user conveys to

the computer in an attempt to communicate theinformation need.

A document is relevantif it is one that the userperceives as containing information of value with

respect to their personal information need.

8/3/2019 cs707_010312

10/23

Prasad L1IntroIR 10

DIKW Hierarchy

Data: Symbolic unitsqE.g., Records of customer.q

E.g., Bytes from sensors. Information : Data with an interpretation

(Who?, What?, When?, Where?).

qE.g., Records of current/new customergrouped by their ages.

qE.g., Variation in temperature readings.

8/3/2019 cs707_010312

11/23

Prasad L1IntroIR 11

DIKW Hierarchy

Knowledge : Information organized withtheoretical concepts or abstract ideas (How?)qE.g., How many customers have cancelled the

accounts in current fiscal year?qE.g., Analysis of temperature variation over the years

and their causes.

Wisdom : Understanding of fundamentalprinciples + Human Judgement

qE.g., What strategies can be employed to retaincustomers in the face of cheaper alternatives?

qE.g., Global warming issues and the future of Earth.

8/3/2019 cs707_010312

12/23

Prasad L1IntroIR 12

Data

Information

Knowledge

Wisdom

Understanding

Context

Researching Absorbing Doing Interacting Reflecting

Joining of

wholes

Formation

of a whole

Connection

of parts

Gathering

of parts

Past

Future

Experience

Novelty

DIKW hierarchy: Clark 2004

8/3/2019 cs707_010312

13/23

Prasad L1IntroIR 13

You see things; and you say "Why?"But I dream things that never were;

and I say "Why not?"George Bernard Shaw

8/3/2019 cs707_010312

14/23

Prasad L1IntroIR 14

Information vs Data Retrieval

Unstructured : open tointerpretation

Usually incomplete orambiguous (w.r.t.information need)

Partial match allowed,relevance-basedranking

Probabilisticunderpinnings

Library

Structured withwell-definedsemantics

Well-definedsemantics

Exact matchrequired - no ormany results

Foundations:Algebra/Logic

Accounting

DATA:

QUERY : QUALITY OF

RESULTS:

FOUNDATIONS:

APPLICATION:

8/3/2019 cs707_010312

15/23

Prasad L1IntroIR 15

User Task

qRetrieval Purposeful HP Multifunction Printer Information

qBrowsingCasual Big Bang, CBR, Element Genesis, Supernova, ...

Hyperlink-basedqFiltering by Agents

Push Podcasts from B.B.C.s Naked Science

Retrieval

Browsing

Database

8/3/2019 cs707_010312

16/23

Prasad L1IntroIR 16

Logical View of Documents

Abstraction (essentials)qStructure, fonts, proximity, repetitions, etc

structure

Accents

spacing stopwordsNoun

groups stemmingManual

indexingDocs

structure Full text Index terms

8/3/2019 cs707_010312

17/23

Prasad L1IntroIR 17

User

Interface

Text Operations

Query

Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB ManagerModule

4, 10

6, 7

5 8

2

8

Text

Database

Text

The Retrieval Process

8/3/2019 cs707_010312

18/23

Prasad L1IntroIR 18

IR Basics

Models and retrieval evaluation Query languages and operations

Improve inferring query context

(query expansion, relevance feedback) Text operations

Improve gleaning of document semantics

(stemming keywords)

Efficient Access: Index and SearchqVisualization, Multimedia, Applications,

8/3/2019 cs707_010312

19/23

Prasad L1IntroIR 19

Clustering and classification

Given a set of docs, group them intoclusters based on their content.

Given a set of topics, plus a new doc D,decide which topic(s) D belongs to.

8/3/2019 cs707_010312

20/23

Prasad L1IntroIR 20

The web and its challenges

Unusual and diverse documents Unusual and diverse users, queries,

information needs

Beyond terms, exploit ideas fromsocial networks

qlink analysis, clickstreams, ...

How do search engines work? Andhow can we make them better?

8/3/2019 cs707_010312

21/23

Prasad L1IntroIR 21

More sophisticated semi-

structured search Title is about Object Oriented

Programming ANDAuthorsomething like

stro*rup

qwhere * is the wild-card operator Issues:qhow do you process about?qhow do you rank results?

The focus of XML search.

8/3/2019 cs707_010312

22/23

Prasad L1IntroIR 22

More sophisticated information

retrieval

Cross-language information retrieval Question answering Summarization Text mining

8/3/2019 cs707_010312

23/23

Prasad L1IntroIR 23

Future Progress: Factors/Trends

Large, uncontrolled publishing mediaqQuality issues

Cheap, fast and wide accessqEase of use (query formulation)

Variety and flexibilityqNavigational and Visualization aidsqDirectory-based (Table of contents) vs Keywords-

based (Inverted File Index) Index terms (automatic/human-created) vs Full-text Privacy, Security, Copyright

cs707_010312

Documents

Transcript of cs707_010312