IR1 Introduction

download IR1 Introduction

of 33

Transcript of IR1 Introduction

  • 8/2/2019 IR1 Introduction

    1/33

    Salah Hammami

    2008

    Lecture 1: Introduction

    CSC483: Information RetrievalCSC483: Information Retrieval

  • 8/2/2019 IR1 Introduction

    2/33

    Lecture 1: Introduction 2

    dimensions of the IR "problem:

    functions of an IR system

    components of an IR system

    factors which optimize the IR process

    examine current research issues in IR

    explore examples of industrial IR applications

    form a broad picture of the IR field

    Course Goals

  • 8/2/2019 IR1 Introduction

    3/33

    Lecture 1: Introduction 3

    Modern Information RetrievalModern Information Retrievalby R. Baeza-Yates and B. Ribeiro-Neto, 2001

    Information Retrieval: A SurveyInformation Retrieval: A Survey

    by Ed Greengrass, 2000

    Information DiscoveryInformation Discoveryby Theo van der Weide, 2001

    Introduction to Information RetrievalIntroduction to Information Retrieval

    by C. D. Manning, P.Raghavan and H. Schtze (in preparation)

    Lecture slides & notes

    Additional study material & Web links

    Course Material

  • 8/2/2019 IR1 Introduction

    4/33

    Lecture 1: Introduction 4

    Course Lectures Overview

    1. IR Models

    2. IR Query

    Languages &

    Operations

    6. Semantic in IR3. Searcher

    Feedback

    IR introduction IR research issues Applications of IR

    http://tech.groups.yahoo.com/group/csc483/

    4. Language

    Modeling for IR

    8. Multimedia IR

    5. Search Engines

    9. Structured

    Content

  • 8/2/2019 IR1 Introduction

    5/33

    Lecture 1: Introduction 5

    IR Related Areas

    Database Management

    Library and Information Science

    Artificial Intelligence

    Natural Language Processing

    Machine Learning

  • 8/2/2019 IR1 Introduction

    6/33

    Lecture 1: Introduction 6

    IR Related Areas

    Database Management

    structureddata in relational tables vs. free-form text well-defined queries in formal language (SQL)

    Recent move towardsRecent move towardssemisemi--structuredstructureddata (XML)data (XML)

    Library and Information Science

    user aspects of IR categorization of human knowledge

    citation analysis

    bibliometrics (structure of information)

    Recent work onRecent work ondigital librariesdigital libraries

    Artificial Intelligence (AI)

    Knowledge representation, reasoning, formalisms, e.g. first-orderpredicate logic, Bayesian networks

    Recent work onRecent work on WebWebontologiesontologies andandIntelligent Information AgentsIntelligent Information Agents

  • 8/2/2019 IR1 Introduction

    7/33Lecture 1: Introduction 7

    IR Related Areas

    Natural Language Processing (NLP)

    Syntactic, semantic, and pragmatic analysis of text & discourse

    Retrieval based onmeaning rather than keywords

    analyzing the syntax (phrase structure) and semantics

    Determining sense of ambiguous words (context-based)

    Identifying specific pieces of information in a document

    Answering specific NL questions

    Recent work inRecent work in GATE (general architecture for text engineeringGATE (general architecture for text engineering --http://gate.ac.uk/http://gate.ac.uk/))

    Machine learning (ML)

    computational systems - experienced-based improving of performance

    automated classification of examples (supervised learning)

    automated clustering of examples (unsupervised learning)

  • 8/2/2019 IR1 Introduction

    8/33Lecture 1: Introduction 8

    IR is not databases

  • 8/2/2019 IR1 Introduction

    9/33Lecture 1: Introduction 9

    increasing amount

    of information

    dynamic user demands

    understand

    manage

    distributed information repositories

    various information

    complex user goals

    customization

    demand

    speed

    precision

    Main task: Information retrievalMain task: Information retrieval

    Current situation:

    The Information Age

  • 8/2/2019 IR1 Introduction

    10/33Lecture 1: Introduction 10

    Information Retrieval (IR) is the task of finding relevant textswithin a large amount ofunstructured data

    Relevant = texts matching some specific criteria.

    Examples of IR tasks: searching for emails from a given

    person, searching for an event that occurred on a given date

    using the Internet, etc.

    Examples of IR systems: www search engines, specific searchengines (laws, medical documents), etc.

    NB: Databases Management Systems (DBMS) are different

    from IR systems (data stored in a DB are structured!)

    Definition

    Information Retrieval

  • 8/2/2019 IR1 Introduction

    11/33Lecture 1: Introduction 11

    Goal of IR is to retrieve all and only the relevant documents in a

    collection for a particular user with a particular need for information

    Relevance is a central concept in IR theory

    How does an IR system work when the collection is all documentsavailable on the Web?

    Web search engines are stress-testing the traditional IR models

    Information Retrieval

  • 8/2/2019 IR1 Introduction

    12/33Lecture 1: Introduction 12

    The goal is to search large document collections (millions of documents) toretrieve small subsets relevant to the users information need

    Examples are:

    Internet search engines

    Digital library catalogues

    Some application areas within IR

    Cross language retrieval

    Speech/broadcast retrieval

    Text categorization

    Text summarization

    Subject to objective testing and evaluation

    hundreds of queries

    millions of documents (the TREC set and conference)

    Information Retrieval

  • 8/2/2019 IR1 Introduction

    13/33Lecture 1: Introduction 13

    IR in general ...

    IR discipline that deals with:

    retrieval

    representation

    storage organization

    access

    ofstructured, semistructured, semi--structuredstructured and unstructured dataunstructured data

    (information objects)

    in response to queryquery (topic statement) structured (e.g. boolean expression)

    unstructured (e.g. sentence, document)

    Information Retrieval

  • 8/2/2019 IR1 Introduction

    14/33Lecture 1: Introduction 14

    in other words

    The process of applyingalgorithmsalgorithms over unstructured, semi-

    structured or structureddatadata in order to satisfy a given

    information (explicit) queryquery

    Efficiency with respect to:

    algorithms

    query building

    data organization/structure

    Information Retrieval

  • 8/2/2019 IR1 Introduction

    15/33Lecture 1: Introduction 15

    and in other words

    DataData

    AlgorithmAlgorithm QueryQuery

    CMCMcontent modelcontent model

    how to organize

    what structures what data

    optimal

    what CMattributes

    how to build what CM attributes

    what attributes

    what structure

    what rules how to build

    Information Retrieval

  • 8/2/2019 IR1 Introduction

    16/33Lecture 1: Introduction 16

    Data retrieval

    which docs contain a set of keywords? Well defined semantics

    a single erroneous object implies failure!

    Information retrieval

    information about a subject or topic

    semantics is frequently loose

    small errors are tolerated

    IR system interpret contents of information items

    generate a ranking which reflects relevance

    notion of relevance is most important

    Data vs. Information Retrieval

  • 8/2/2019 IR1 Introduction

    17/33Lecture 1: Introduction 17

    IR Systems

    IR SystemIR SystemUserUser

    QueryQueryRanked list ofRanked list of

    documentsdocuments

    interpret contents ofinterpret contents of

    information objectsinformation objects

    generate a rankinggenerate a ranking

    which reflects relevancewhich reflects relevance

  • 8/2/2019 IR1 Introduction

    18/33Lecture 1: Introduction 18

    IR System

    disclosure for a collection OO ofnn information objects

    user is interested in information objects

    interest model as a partial order on the collection

    a set of relevant

    a set of irrelevant documents

    produces a (total) ordering resembling the users interestcomparative

    model to user info need

    The Information Need has

    qualitative and quantitative aspects

    expressed in a query

    Information Need

    IR Systems

  • 8/2/2019 IR1 Introduction

    19/33Lecture 1: Introduction 19

    Basic Concepts

    i C

  • 8/2/2019 IR1 Introduction

    20/33Lecture 1: Introduction 20

    Basic Concepts: The User Task

    Pull actions User requests information in an interactive manner

    Push actions

    Software agents push the information towards the users

    B i C

  • 8/2/2019 IR1 Introduction

    21/33Lecture 1: Introduction 21

    single unit of informationsingle unit of information

    typically text in a digital form other media

    a complete logical unit (e.g. book, article)

    a part of a larger text (e.g. passage, section, entry in a dictionary)

    any physical unit (e.g. file, email, web page)

    Document

    Basic Concepts: The User Task

  • 8/2/2019 IR1 Introduction

    22/33Lecture 1: Introduction 22

    The Standard Retrieval Interaction Model

  • 8/2/2019 IR1 Introduction

    23/33Lecture 1: Introduction 23

    Standard Model of IR

    Assumptions:

    The goal is maximizing precision and recall simultaneously

    The information need remains static

    The value is in the resulting document set

  • 8/2/2019 IR1 Introduction

    24/33

    Lecture 1: Introduction 24

    Problems with Standard Model

    Users learn during the search process:

    Scanning titles of retrieved documents

    Reading retrieved documents

    Viewing lists of related topics/thesaurus terms Navigating hyperlinks

    Some users dont like long (apparently) disorganized lists

    of documents

  • 8/2/2019 IR1 Introduction

    25/33

    Lecture 1: Introduction 25

    IR is an Iterative Process

    Repositories

    Workspace

    Goals

  • 8/2/2019 IR1 Introduction

    26/33

    Lecture 1: Introduction 26

    IR is a Dialog

    The exchange doesnt end with first answer

    Users can recognize elements of a useful answer, even whenincomplete

    Questions and understanding changes as the process continues

  • 8/2/2019 IR1 Introduction

    27/33

    Lecture 1: Introduction 27

    Information Retrieval

    Revised Task Statement:

    Build a system that retrieves documents that users are likely to

    find relevant to their queries

    This set of assumptions underlies the field of Information

    Retrieval

    Th R t i l P

  • 8/2/2019 IR1 Introduction

    28/33

    Lecture 1: Introduction 28

    UserUserInterfaceInterface

    Text OperationsText Operations

    QueryQuery

    OperationsOperations

    IndexingIndexing

    SearchingSearching

    Ranking

    IndexIndex

    55

    TextText

    DatabaseDatabase

    The Retrieval Process ...

    logical view logical view

    inverted filequery generated

    retrieved docs

    ranking docs

    user feedback change the query

    33

    11

    22

    11text defines logical view

    text

    specifies user need

    DB ManagerDB Manager

    ModuleModule

    44

    66

    77

    88

    99

    1010

    builds

    Th R t i l P

  • 8/2/2019 IR1 Introduction

    29/33

    Lecture 1: Introduction 29

    DocumentsDocuments

    Information NeedInformation Need

    Index TermsIndex Terms

    documentdocument

    queryquery

    rankingrankingmatchmatch

    The Retrieval Process ...

    Th R t i l P

  • 8/2/2019 IR1 Introduction

    30/33

    Lecture 1: Introduction 30

    MatchingMatchingindex terms is quite impreciseimprecise

    UsersUsersget frequently unsatisfiedunsatisfied UsersUsershave no trainingno trainingin query formation

    Frequent dissatisfaction of Web users

    RelevanceRelevanceis critical for IR systems: rankingranking

    OrderingOrderingretrieved documents reflects their relevancerelevance

    to useruserqueryquery Fundamental premicespremicesfor relevance:

    common sets of index terms sharing of weighted terms likelihood of relevance

    Each set of premicesset of premicesleads to a distinct IR modelIR model

    The Retrieval Process ...

    IR T

  • 8/2/2019 IR1 Introduction

    31/33

    Lecture 1: Introduction 31

    NonNon--Overlapping ListsOverlapping Lists

    Proximal NodesProximal Nodes

    Structured ModelsStructured Models

    Retrieval:Retrieval:

    Adhoc

    Filtering

    BrowsingBrowsing

    Classic ModelsClassic Models

    BooleanBoolean

    VectorVector

    ProbabilisticProbabilistic

    Set TheoreticSet TheoreticFuzzyFuzzy

    Extended BooleanExtended Boolean

    ProbabilisticProbabilistic

    Inference NetworkInference Network

    Belief NetworkBelief Network

    AlgebraicAlgebraic

    Generalized VectorGeneralized VectorLatent Semantic IndexLatent Semantic Index

    Neural NetworksNeural Networks

    BrowsingBrowsing

    FlatFlat

    Structure GuidedStructure Guided

    HypertextHypertext

    UserTask

    UserTask

    IR Taxonomy

    IR T Ad H R i l

  • 8/2/2019 IR1 Introduction

    32/33

    Lecture 1: Introduction 32

    CollectionFixed Size

    Q2

    Q3

    Q1

    Q4Q5

    collection remains relatively staticcollection remains relatively static

    new queries arenew queries are

    submitted to the systemsubmitted to the system

    IR Taxonomy: Ad Hoc Retrieval

    a person having a need for information

    a set of information objects to satisfy the

    need

    models to formalize the information need

    stable (fixed) info collection

    user interest is valid during some period of time

    query only expresses the information need at

    some point in time

    IR Taxonomy: Filt i

  • 8/2/2019 IR1 Introduction

    33/33

    Documents Stream

    User 1

    Profile

    User 2

    Profile

    Docs Filtered

    for User 2

    Docs for

    User 1

    Queries remain relatively staticQueries remain relatively static

    New documents come into the systemNew documents come into the system

    IR Taxonomy: Filtering

    (continuous) stream of documents

    e.g. newsgroups

    decision for each document

    no preprocessing of all documents