Post on 06-Apr-2018
8/3/2019 cs707_010312
1/23
Prasad L1IntroIR 1
Information Retrieval
Adapted from Lectures by
Berthier Ribeiro-Neto (Brazil),
Prabhakar Raghavan (Yahoo and Stanford)and Christopher Manning (Stanford)
8/3/2019 cs707_010312
2/23
Prasad L1IntroIR 2
Unstructured (text) vs. structured
(database) data in 1996
0
20
40
60
80
100
120
140
160
Data volume Market Cap
Unstructured
Structured
8/3/2019 cs707_010312
3/23
Prasad L1IntroIR 3
Unstructured (text) vs. structured
(database) data in 2006
0
20
40
60
80
100
120
140
160
Data volume Market Cap
Unstructured
Structured
8/3/2019 cs707_010312
4/23
Prasad L1IntroIR 4
Structured vs unstructured data
Structured data : information in tables
Employee Manager SalarySmith Jones 50000
Chang Smith 60000
50000Ivy Smith
Typically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.
8/3/2019 cs707_010312
5/23
Prasad L1IntroIR 5
Unstructured data
Typically refers to free textqData which does not have clear, semanticallyovert, easy-for-a-computer structure
AllowsqKeyword-based queries including operatorsqMore sophisticated concept queries, e.g.,
find all web pages dealing with drug abuse
8/3/2019 cs707_010312
6/23
Prasad L1IntroIR 6
Semi-structured data
In fact almost no data is unstructuredqE.g., this slide has distinctly identified zones
such as the Title and Bullets
Facilitates semi-structured search suchas
qTitle contains data AND Bullets containsearch to say nothing of linguistic structure
8/3/2019 cs707_010312
7/23
Prasad L1IntroIR 7
What is IR?
Representation Keywords/Phrases, Structure/Fonts, Counts, etc
Organization and Storage
Inverted File Index, Compressed, etc Hardware Architecture and Memory Hierarchy
Access to information items Interface : Spell-checker to tree-structured display Visualization : Labeled Clusters, Timelines, Spring graphs,
etc.
8/3/2019 cs707_010312
8/23
Prasad L1IntroIR 8
Ultimate Focus of IR
Satisfying user information needqEmphasis is on retrieval of information (not data)
User information need : ExamplesqPrinter reviews;Printer prices and availabilityqWords in which all vowels appearqAnagram/Permutations ofartqFlight numbers; UPS/FedEx/USPS Tracking code
Predicting which documents are relevant,and then linearly ranking them.
8/3/2019 cs707_010312
9/23
Prasad L1IntroIR 9
Information Need : Query, Relevancy
An information needis the topic about which theuser desires to know more, and is differentiated
from a query, which is what the user conveys to
the computer in an attempt to communicate theinformation need.
A document is relevantif it is one that the userperceives as containing information of value with
respect to their personal information need.
8/3/2019 cs707_010312
10/23
Prasad L1IntroIR 10
DIKW Hierarchy
Data: Symbolic unitsqE.g., Records of customer.q
E.g., Bytes from sensors. Information : Data with an interpretation
(Who?, What?, When?, Where?).
qE.g., Records of current/new customergrouped by their ages.
qE.g., Variation in temperature readings.
8/3/2019 cs707_010312
11/23
Prasad L1IntroIR 11
DIKW Hierarchy
Knowledge : Information organized withtheoretical concepts or abstract ideas (How?)qE.g., How many customers have cancelled the
accounts in current fiscal year?qE.g., Analysis of temperature variation over the years
and their causes.
Wisdom : Understanding of fundamentalprinciples + Human Judgement
qE.g., What strategies can be employed to retaincustomers in the face of cheaper alternatives?
qE.g., Global warming issues and the future of Earth.
8/3/2019 cs707_010312
12/23
Prasad L1IntroIR 12
Data
Information
Knowledge
Wisdom
Understanding
Context
Researching Absorbing Doing Interacting Reflecting
Joining of
wholes
Formation
of a whole
Connection
of parts
Gathering
of parts
Past
Future
Experience
Novelty
DIKW hierarchy: Clark 2004
8/3/2019 cs707_010312
13/23
Prasad L1IntroIR 13
You see things; and you say "Why?"But I dream things that never were;
and I say "Why not?"George Bernard Shaw
8/3/2019 cs707_010312
14/23
Prasad L1IntroIR 14
Information vs Data Retrieval
Unstructured : open tointerpretation
Usually incomplete orambiguous (w.r.t.information need)
Partial match allowed,relevance-basedranking
Probabilisticunderpinnings
Library
Structured withwell-definedsemantics
Well-definedsemantics
Exact matchrequired - no ormany results
Foundations:Algebra/Logic
Accounting
DATA:
QUERY : QUALITY OF
RESULTS:
FOUNDATIONS:
APPLICATION:
8/3/2019 cs707_010312
15/23
Prasad L1IntroIR 15
User Task
qRetrieval Purposeful HP Multifunction Printer Information
qBrowsingCasual Big Bang, CBR, Element Genesis, Supernova, ...
Hyperlink-basedqFiltering by Agents
Push Podcasts from B.B.C.s Naked Science
Retrieval
Browsing
Database
8/3/2019 cs707_010312
16/23
Prasad L1IntroIR 16
Logical View of Documents
Abstraction (essentials)qStructure, fonts, proximity, repetitions, etc
structure
Accents
spacing stopwordsNoun
groups stemmingManual
indexingDocs
structure Full text Index terms
8/3/2019 cs707_010312
17/23
Prasad L1IntroIR 17
User
Interface
Text Operations
Query
Operations Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical viewlogical view
inverted file
DB ManagerModule
4, 10
6, 7
5 8
2
8
Text
Database
Text
The Retrieval Process
8/3/2019 cs707_010312
18/23
Prasad L1IntroIR 18
IR Basics
Models and retrieval evaluation Query languages and operations
Improve inferring query context
(query expansion, relevance feedback) Text operations
Improve gleaning of document semantics
(stemming keywords)
Efficient Access: Index and SearchqVisualization, Multimedia, Applications,
8/3/2019 cs707_010312
19/23
Prasad L1IntroIR 19
Clustering and classification
Given a set of docs, group them intoclusters based on their content.
Given a set of topics, plus a new doc D,decide which topic(s) D belongs to.
8/3/2019 cs707_010312
20/23
Prasad L1IntroIR 20
The web and its challenges
Unusual and diverse documents Unusual and diverse users, queries,
information needs
Beyond terms, exploit ideas fromsocial networks
qlink analysis, clickstreams, ...
How do search engines work? Andhow can we make them better?
8/3/2019 cs707_010312
21/23
Prasad L1IntroIR 21
More sophisticated semi-
structured search Title is about Object Oriented
Programming ANDAuthorsomething like
stro*rup
qwhere * is the wild-card operator Issues:qhow do you process about?qhow do you rank results?
The focus of XML search.
8/3/2019 cs707_010312
22/23
Prasad L1IntroIR 22
More sophisticated information
retrieval
Cross-language information retrieval Question answering Summarization Text mining
8/3/2019 cs707_010312
23/23
Prasad L1IntroIR 23
Future Progress: Factors/Trends
Large, uncontrolled publishing mediaqQuality issues
Cheap, fast and wide accessqEase of use (query formulation)
Variety and flexibilityqNavigational and Visualization aidsqDirectory-based (Table of contents) vs Keywords-
based (Inverted File Index) Index terms (automatic/human-created) vs Full-text Privacy, Security, Copyright