cs707_010312

download cs707_010312

of 23

Transcript of cs707_010312

  • 8/3/2019 cs707_010312

    1/23

    Prasad L1IntroIR 1

    Information Retrieval

    Adapted from Lectures by

    Berthier Ribeiro-Neto (Brazil),

    Prabhakar Raghavan (Yahoo and Stanford)and Christopher Manning (Stanford)

  • 8/3/2019 cs707_010312

    2/23

    Prasad L1IntroIR 2

    Unstructured (text) vs. structured

    (database) data in 1996

    0

    20

    40

    60

    80

    100

    120

    140

    160

    Data volume Market Cap

    Unstructured

    Structured

  • 8/3/2019 cs707_010312

    3/23

    Prasad L1IntroIR 3

    Unstructured (text) vs. structured

    (database) data in 2006

    0

    20

    40

    60

    80

    100

    120

    140

    160

    Data volume Market Cap

    Unstructured

    Structured

  • 8/3/2019 cs707_010312

    4/23

    Prasad L1IntroIR 4

    Structured vs unstructured data

    Structured data : information in tables

    Employee Manager SalarySmith Jones 50000

    Chang Smith 60000

    50000Ivy Smith

    Typically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.

  • 8/3/2019 cs707_010312

    5/23

    Prasad L1IntroIR 5

    Unstructured data

    Typically refers to free textqData which does not have clear, semanticallyovert, easy-for-a-computer structure

    AllowsqKeyword-based queries including operatorsqMore sophisticated concept queries, e.g.,

    find all web pages dealing with drug abuse

  • 8/3/2019 cs707_010312

    6/23

    Prasad L1IntroIR 6

    Semi-structured data

    In fact almost no data is unstructuredqE.g., this slide has distinctly identified zones

    such as the Title and Bullets

    Facilitates semi-structured search suchas

    qTitle contains data AND Bullets containsearch to say nothing of linguistic structure

  • 8/3/2019 cs707_010312

    7/23

    Prasad L1IntroIR 7

    What is IR?

    Representation Keywords/Phrases, Structure/Fonts, Counts, etc

    Organization and Storage

    Inverted File Index, Compressed, etc Hardware Architecture and Memory Hierarchy

    Access to information items Interface : Spell-checker to tree-structured display Visualization : Labeled Clusters, Timelines, Spring graphs,

    etc.

  • 8/3/2019 cs707_010312

    8/23

    Prasad L1IntroIR 8

    Ultimate Focus of IR

    Satisfying user information needqEmphasis is on retrieval of information (not data)

    User information need : ExamplesqPrinter reviews;Printer prices and availabilityqWords in which all vowels appearqAnagram/Permutations ofartqFlight numbers; UPS/FedEx/USPS Tracking code

    Predicting which documents are relevant,and then linearly ranking them.

  • 8/3/2019 cs707_010312

    9/23

    Prasad L1IntroIR 9

    Information Need : Query, Relevancy

    An information needis the topic about which theuser desires to know more, and is differentiated

    from a query, which is what the user conveys to

    the computer in an attempt to communicate theinformation need.

    A document is relevantif it is one that the userperceives as containing information of value with

    respect to their personal information need.

  • 8/3/2019 cs707_010312

    10/23

    Prasad L1IntroIR 10

    DIKW Hierarchy

    Data: Symbolic unitsqE.g., Records of customer.q

    E.g., Bytes from sensors. Information : Data with an interpretation

    (Who?, What?, When?, Where?).

    qE.g., Records of current/new customergrouped by their ages.

    qE.g., Variation in temperature readings.

  • 8/3/2019 cs707_010312

    11/23

    Prasad L1IntroIR 11

    DIKW Hierarchy

    Knowledge : Information organized withtheoretical concepts or abstract ideas (How?)qE.g., How many customers have cancelled the

    accounts in current fiscal year?qE.g., Analysis of temperature variation over the years

    and their causes.

    Wisdom : Understanding of fundamentalprinciples + Human Judgement

    qE.g., What strategies can be employed to retaincustomers in the face of cheaper alternatives?

    qE.g., Global warming issues and the future of Earth.

  • 8/3/2019 cs707_010312

    12/23

    Prasad L1IntroIR 12

    Data

    Information

    Knowledge

    Wisdom

    Understanding

    Context

    Researching Absorbing Doing Interacting Reflecting

    Joining of

    wholes

    Formation

    of a whole

    Connection

    of parts

    Gathering

    of parts

    Past

    Future

    Experience

    Novelty

    DIKW hierarchy: Clark 2004

  • 8/3/2019 cs707_010312

    13/23

    Prasad L1IntroIR 13

    You see things; and you say "Why?"But I dream things that never were;

    and I say "Why not?"George Bernard Shaw

  • 8/3/2019 cs707_010312

    14/23

    Prasad L1IntroIR 14

    Information vs Data Retrieval

    Unstructured : open tointerpretation

    Usually incomplete orambiguous (w.r.t.information need)

    Partial match allowed,relevance-basedranking

    Probabilisticunderpinnings

    Library

    Structured withwell-definedsemantics

    Well-definedsemantics

    Exact matchrequired - no ormany results

    Foundations:Algebra/Logic

    Accounting

    DATA:

    QUERY : QUALITY OF

    RESULTS:

    FOUNDATIONS:

    APPLICATION:

  • 8/3/2019 cs707_010312

    15/23

    Prasad L1IntroIR 15

    User Task

    qRetrieval Purposeful HP Multifunction Printer Information

    qBrowsingCasual Big Bang, CBR, Element Genesis, Supernova, ...

    Hyperlink-basedqFiltering by Agents

    Push Podcasts from B.B.C.s Naked Science

    Retrieval

    Browsing

    Database

  • 8/3/2019 cs707_010312

    16/23

    Prasad L1IntroIR 16

    Logical View of Documents

    Abstraction (essentials)qStructure, fonts, proximity, repetitions, etc

    structure

    Accents

    spacing stopwordsNoun

    groups stemmingManual

    indexingDocs

    structure Full text Index terms

  • 8/3/2019 cs707_010312

    17/23

    Prasad L1IntroIR 17

    User

    Interface

    Text Operations

    Query

    Operations Indexing

    Searching

    Ranking

    Index

    Text

    query

    user need

    user feedback

    ranked docs

    retrieved docs

    logical viewlogical view

    inverted file

    DB ManagerModule

    4, 10

    6, 7

    5 8

    2

    8

    Text

    Database

    Text

    The Retrieval Process

  • 8/3/2019 cs707_010312

    18/23

    Prasad L1IntroIR 18

    IR Basics

    Models and retrieval evaluation Query languages and operations

    Improve inferring query context

    (query expansion, relevance feedback) Text operations

    Improve gleaning of document semantics

    (stemming keywords)

    Efficient Access: Index and SearchqVisualization, Multimedia, Applications,

  • 8/3/2019 cs707_010312

    19/23

    Prasad L1IntroIR 19

    Clustering and classification

    Given a set of docs, group them intoclusters based on their content.

    Given a set of topics, plus a new doc D,decide which topic(s) D belongs to.

  • 8/3/2019 cs707_010312

    20/23

    Prasad L1IntroIR 20

    The web and its challenges

    Unusual and diverse documents Unusual and diverse users, queries,

    information needs

    Beyond terms, exploit ideas fromsocial networks

    qlink analysis, clickstreams, ...

    How do search engines work? Andhow can we make them better?

  • 8/3/2019 cs707_010312

    21/23

    Prasad L1IntroIR 21

    More sophisticated semi-

    structured search Title is about Object Oriented

    Programming ANDAuthorsomething like

    stro*rup

    qwhere * is the wild-card operator Issues:qhow do you process about?qhow do you rank results?

    The focus of XML search.

  • 8/3/2019 cs707_010312

    22/23

    Prasad L1IntroIR 22

    More sophisticated information

    retrieval

    Cross-language information retrieval Question answering Summarization Text mining

  • 8/3/2019 cs707_010312

    23/23

    Prasad L1IntroIR 23

    Future Progress: Factors/Trends

    Large, uncontrolled publishing mediaqQuality issues

    Cheap, fast and wide accessqEase of use (query formulation)

    Variety and flexibilityqNavigational and Visualization aidsqDirectory-based (Table of contents) vs Keywords-

    based (Inverted File Index) Index terms (automatic/human-created) vs Full-text Privacy, Security, Copyright